Previous changeset 30:4188853b940b (2013-07-26) Next changeset 32:03c22b722882 (2013-09-20) |
Commit message:
Update to Miller Lab devshed revision 3c4110ffacc3 |
modified:
README add_fst_column.xml aggregate_gd_indivs.xml average_fst.xml coverage_distributions.xml diversity_pi.py diversity_pi.xml dpmix.py dpmix.xml dpmix_plot.py draw_variants.py draw_variants.xml filter_gd_snp.xml find_intervals.xml nucleotide_diversity_pi.xml pathway_image.xml pca.xml phylogenetic_tree.xml population_structure.xml prepare_population_structure.xml rank_pathways.xml rank_terms.xml reorder.xml |
added:
assignment_of_optimal_breeding_pairs.py assignment_of_optimal_breeding_pairs.xml cluster_kegg.xml cluster_onConnctdComps.py discover_familial_relationships.py discover_familial_relationships.xml gd_snp2vcf.pl gd_snp2vcf.xml inbreeding_and_kinship.py inbreeding_and_kinship.xml make_phylip.py make_phylip.xml offspring_heterozygosity.py offspring_heterozygosity.xml offspring_heterozygosity_pedigree.py offspring_heterozygosity_pedigree.xml raxml.py raxml.xml static/images/cluster_kegg_formula.png static/images/gd_coverage.png tool_dependencies.xml |
removed:
BeautifulSoup.py genome_diversity/Makefile genome_diversity/bin/gd_ploteig genome_diversity/bin/varplot genome_diversity/src/Fst_ave.c genome_diversity/src/Fst_column.c genome_diversity/src/Fst_lib.c genome_diversity/src/Fst_lib.h genome_diversity/src/Huang.c genome_diversity/src/Huang.h genome_diversity/src/Makefile genome_diversity/src/admix_prep.c genome_diversity/src/aggregate.c genome_diversity/src/coords2admix.c genome_diversity/src/coverage.c genome_diversity/src/dist_mat.c genome_diversity/src/dpmix.c genome_diversity/src/eval2pct.c genome_diversity/src/filter_snps.c genome_diversity/src/get_pi.c genome_diversity/src/lib.c genome_diversity/src/lib.h genome_diversity/src/mito_lib.c genome_diversity/src/mito_lib.h genome_diversity/src/mk_Ji.c genome_diversity/src/mt_pi.c genome_diversity/src/sweep.c |
b |
diff -r 4188853b940b -r a631c2f6d913 BeautifulSoup.py --- a/BeautifulSoup.py Fri Jul 26 12:51:13 2013 -0400 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 |
b |
b'@@ -1,2014 +0,0 @@\n-"""Beautiful Soup\n-Elixir and Tonic\n-"The Screen-Scraper\'s Friend"\n-http://www.crummy.com/software/BeautifulSoup/\n-\n-Beautiful Soup parses a (possibly invalid) XML or HTML document into a\n-tree representation. It provides methods and Pythonic idioms that make\n-it easy to navigate, search, and modify the tree.\n-\n-A well-formed XML/HTML document yields a well-formed data\n-structure. An ill-formed XML/HTML document yields a correspondingly\n-ill-formed data structure. If your document is only locally\n-well-formed, you can use this library to find and process the\n-well-formed part of it.\n-\n-Beautiful Soup works with Python 2.2 and up. It has no external\n-dependencies, but you\'ll have more success at converting data to UTF-8\n-if you also install these three packages:\n-\n-* chardet, for auto-detecting character encodings\n- http://chardet.feedparser.org/\n-* cjkcodecs and iconv_codec, which add more encodings to the ones supported\n- by stock Python.\n- http://cjkpython.i18n.org/\n-\n-Beautiful Soup defines classes for two main parsing strategies:\n-\n- * BeautifulStoneSoup, for parsing XML, SGML, or your domain-specific\n- language that kind of looks like XML.\n-\n- * BeautifulSoup, for parsing run-of-the-mill HTML code, be it valid\n- or invalid. This class has web browser-like heuristics for\n- obtaining a sensible parse tree in the face of common HTML errors.\n-\n-Beautiful Soup also defines a class (UnicodeDammit) for autodetecting\n-the encoding of an HTML or XML document, and converting it to\n-Unicode. Much of this code is taken from Mark Pilgrim\'s Universal Feed Parser.\n-\n-For more than you ever wanted to know about Beautiful Soup, see the\n-documentation:\n-http://www.crummy.com/software/BeautifulSoup/documentation.html\n-\n-Here, have some legalese:\n-\n-Copyright (c) 2004-2010, Leonard Richardson\n-\n-All rights reserved.\n-\n-Redistribution and use in source and binary forms, with or without\n-modification, are permitted provided that the following conditions are\n-met:\n-\n- * Redistributions of source code must retain the above copyright\n- notice, this list of conditions and the following disclaimer.\n-\n- * Redistributions in binary form must reproduce the above\n- copyright notice, this list of conditions and the following\n- disclaimer in the documentation and/or other materials provided\n- with the distribution.\n-\n- * Neither the name of the the Beautiful Soup Consortium and All\n- Night Kosher Bakery nor the names of its contributors may be\n- used to endorse or promote products derived from this software\n- without specific prior written permission.\n-\n-THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS\n-"AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT\n-LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR\n-A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR\n-CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,\n-EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,\n-PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR\n-PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF\n-LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING\n-NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS\n-SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE, DAMMIT.\n-\n-"""\n-from __future__ import generators\n-\n-__author__ = "Leonard Richardson (leonardr@segfault.org)"\n-__version__ = "3.2.0"\n-__copyright__ = "Copyright (c) 2004-2010 Leonard Richardson"\n-__license__ = "New-style BSD"\n-\n-from sgmllib import SGMLParser, SGMLParseError\n-import codecs\n-import markupbase\n-import types\n-import re\n-import sgmllib\n-try:\n- from htmlentitydefs import name2codepoint\n-except ImportError:\n- name2codepoint = {}\n-try:\n- set\n-except NameError:\n- from sets import Set as set\n-\n-#These hacks make Beautiful Soup able to parse XML with namespaces\n-sgmllib.tagfind = re.co'..b' \'utf-32\', \'utf_16\', \'utf_32\',\n- \'utf16\', \'u16\')):\n- xml_encoding = sniffed_xml_encoding\n- return xml_data, xml_encoding, sniffed_xml_encoding\n-\n-\n- def find_codec(self, charset):\n- return self._codec(self.CHARSET_ALIASES.get(charset, charset)) \\\n- or (charset and self._codec(charset.replace("-", ""))) \\\n- or (charset and self._codec(charset.replace("-", "_"))) \\\n- or charset\n-\n- def _codec(self, charset):\n- if not charset: return charset\n- codec = None\n- try:\n- codecs.lookup(charset)\n- codec = charset\n- except (LookupError, ValueError):\n- pass\n- return codec\n-\n- EBCDIC_TO_ASCII_MAP = None\n- def _ebcdic_to_ascii(self, s):\n- c = self.__class__\n- if not c.EBCDIC_TO_ASCII_MAP:\n- emap = (0,1,2,3,156,9,134,127,151,141,142,11,12,13,14,15,\n- 16,17,18,19,157,133,8,135,24,25,146,143,28,29,30,31,\n- 128,129,130,131,132,10,23,27,136,137,138,139,140,5,6,7,\n- 144,145,22,147,148,149,150,4,152,153,154,155,20,21,158,26,\n- 32,160,161,162,163,164,165,166,167,168,91,46,60,40,43,33,\n- 38,169,170,171,172,173,174,175,176,177,93,36,42,41,59,94,\n- 45,47,178,179,180,181,182,183,184,185,124,44,37,95,62,63,\n- 186,187,188,189,190,191,192,193,194,96,58,35,64,39,61,34,\n- 195,97,98,99,100,101,102,103,104,105,196,197,198,199,200,\n- 201,202,106,107,108,109,110,111,112,113,114,203,204,205,\n- 206,207,208,209,126,115,116,117,118,119,120,121,122,210,\n- 211,212,213,214,215,216,217,218,219,220,221,222,223,224,\n- 225,226,227,228,229,230,231,123,65,66,67,68,69,70,71,72,\n- 73,232,233,234,235,236,237,125,74,75,76,77,78,79,80,81,\n- 82,238,239,240,241,242,243,92,159,83,84,85,86,87,88,89,\n- 90,244,245,246,247,248,249,48,49,50,51,52,53,54,55,56,57,\n- 250,251,252,253,254,255)\n- import string\n- c.EBCDIC_TO_ASCII_MAP = string.maketrans( \\\n- \'\'.join(map(chr, range(256))), \'\'.join(map(chr, emap)))\n- return s.translate(c.EBCDIC_TO_ASCII_MAP)\n-\n- MS_CHARS = { \'\\x80\' : (\'euro\', \'20AC\'),\n- \'\\x81\' : \' \',\n- \'\\x82\' : (\'sbquo\', \'201A\'),\n- \'\\x83\' : (\'fnof\', \'192\'),\n- \'\\x84\' : (\'bdquo\', \'201E\'),\n- \'\\x85\' : (\'hellip\', \'2026\'),\n- \'\\x86\' : (\'dagger\', \'2020\'),\n- \'\\x87\' : (\'Dagger\', \'2021\'),\n- \'\\x88\' : (\'circ\', \'2C6\'),\n- \'\\x89\' : (\'permil\', \'2030\'),\n- \'\\x8A\' : (\'Scaron\', \'160\'),\n- \'\\x8B\' : (\'lsaquo\', \'2039\'),\n- \'\\x8C\' : (\'OElig\', \'152\'),\n- \'\\x8D\' : \'?\',\n- \'\\x8E\' : (\'#x17D\', \'17D\'),\n- \'\\x8F\' : \'?\',\n- \'\\x90\' : \'?\',\n- \'\\x91\' : (\'lsquo\', \'2018\'),\n- \'\\x92\' : (\'rsquo\', \'2019\'),\n- \'\\x93\' : (\'ldquo\', \'201C\'),\n- \'\\x94\' : (\'rdquo\', \'201D\'),\n- \'\\x95\' : (\'bull\', \'2022\'),\n- \'\\x96\' : (\'ndash\', \'2013\'),\n- \'\\x97\' : (\'mdash\', \'2014\'),\n- \'\\x98\' : (\'tilde\', \'2DC\'),\n- \'\\x99\' : (\'trade\', \'2122\'),\n- \'\\x9a\' : (\'scaron\', \'161\'),\n- \'\\x9b\' : (\'rsaquo\', \'203A\'),\n- \'\\x9c\' : (\'oelig\', \'153\'),\n- \'\\x9d\' : \'?\',\n- \'\\x9e\' : (\'#x17E\', \'17E\'),\n- \'\\x9f\' : (\'Yuml\', \'\'),}\n-\n-#######################################################################\n-\n-\n-#By default, act as an HTML pretty-printer.\n-if __name__ == \'__main__\':\n- import sys\n- soup = BeautifulSoup(sys.stdin)\n- print soup.prettify()\n' |
b |
diff -r 4188853b940b -r a631c2f6d913 README --- a/README Fri Jul 26 12:51:13 2013 -0400 +++ b/README Fri Sep 20 13:25:27 2013 -0400 |
b |
@@ -1,18 +1,3 @@ -Source code for the executables needed by these tools can be found in -the genome_diversity directory. - -Additionally, you'll need the following python modules: - matplotlib (we used version 1.1.0) http://pypi.python.org/packages/source/m/matplotlib/ - mechanize (we used version 0.2.5) http://pypi.python.org/packages/source/m/mechanize/ - networkx (we used version 1.6) http://pypi.python.org/packages/source/n/networkx/ - fisher (we used version 0.1.4) http://pypi.python.org/packages/source/f/fisher/ - -And the following software: +The Genome Diversity tools require the following software: ADMIXTURE (we used version 1.22) http://www.genetics.ucla.edu/software/admixture/ - EIGENSOFT (we used version 3.0) http://genepath.med.harvard.edu/~reich/Software.htm - PHAST (we used version 1.2.1) http://compgen.bscb.cornell.edu/phast/ - QuickTree (we used version 1.1) http://www.sanger.ac.uk/resources/software/quicktree/ - -Images used in the tools' documentation are located in the static/images -directory. Please copy these to the static/images directory in your -Galaxy installation. + KING (we used version 1.5) http://people.virginia.edu/~wc9c/KING/ |
b |
diff -r 4188853b940b -r a631c2f6d913 add_fst_column.xml --- a/add_fst_column.xml Fri Jul 26 12:51:13 2013 -0400 +++ b/add_fst_column.xml Fri Sep 20 13:25:27 2013 -0400 |
b |
@@ -77,6 +77,10 @@ <data name="output" format="input" format_source="input" metadata_source="input" /> </outputs> + <requirements> + <requirement type="package" version="0.1">gd_c_tools</requirement> + </requirements> + <tests> <test> <param name="input" value="test_in/sample.gd_snp" ftype="gd_snp" /> |
b |
diff -r 4188853b940b -r a631c2f6d913 aggregate_gd_indivs.xml --- a/aggregate_gd_indivs.xml Fri Jul 26 12:51:13 2013 -0400 +++ b/aggregate_gd_indivs.xml Fri Sep 20 13:25:27 2013 -0400 |
b |
@@ -43,6 +43,10 @@ <data name="output" format="input" format_source="input" metadata_source="input" /> </outputs> + <requirements> + <requirement type="package" version="0.1">gd_c_tools</requirement> + </requirements> + <tests> <test> <param name="input" value="test_in/sample.gd_snp" ftype="gd_snp" /> |
b |
diff -r 4188853b940b -r a631c2f6d913 assignment_of_optimal_breeding_pairs.py --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/assignment_of_optimal_breeding_pairs.py Fri Sep 20 13:25:27 2013 -0400 |
[ |
@@ -0,0 +1,185 @@ +#!/usr/bin/env python2.6 + +import sys +import munkres +import random + +class Vertex(object): + def __init__(self, name): + self.name = name + self.neighbors = {} + self.color = 0 + self.explored = False + + def add_neighbor(self, neighbor, weight=0.0): + if neighbor in self.neighbors: + if self.neighbors[neighbor] != weight: + die('multiple edges not supported') + else: + self.neighbors[neighbor] = weight + +class Graph(object): + def __init__(self): + self.vertex_list = {} + self.vertices = 0 + self.max_weight = 0.0 + + def add_vertex(self, name): + if name not in self.vertex_list: + self.vertex_list[name] = Vertex(name) + self.vertices += 1 + return self.vertex_list[name] + + def add_edge(self, name1, name2, weight): + vertex1 = self.add_vertex(name1) + vertex2 = self.add_vertex(name2) + vertex1.add_neighbor(vertex2, weight) + vertex2.add_neighbor(vertex1, weight) + self.max_weight = max(self.max_weight, weight) + + def from_edge_file(self, filename): + fh = try_open(filename) + line_number = 0 + for line in fh: + line_number += 1 + line = line.rstrip('\r\n') + elems = line.split() + if len(elems) < 3: + die('too few columns on line {0} of {1}:\n{2}'.format(line_number, filename, line)) + name1 = elems[0] + name2 = elems[1] + weight = float_value(elems[2]) + if weight is None: + die('invalid weight on line {0} of {1}:\n{2}'.format(line_number, filename, line)) + self.add_edge(name1, name2, weight) + fh.close() + + def bipartite_partition(self): + vertices_left = self.vertex_list.values() + + while vertices_left: + fifo = [vertices_left[0]] + while fifo: + vertex = fifo.pop(0) + if not vertex.explored: + vertex.explored = True + vertices_left.remove(vertex) + + if vertex.color == 0: + vertex.color = 1 + neighbor_color = 2 + elif vertex.color == 1: + neighbor_color = 2 + elif vertex.color == 2: + neighbor_color = 1 + + for neighbor in vertex.neighbors: + if neighbor.color == 0: + neighbor.color = neighbor_color + elif neighbor.color != neighbor_color: + return None, None + fifo.append(neighbor) + + c1 = [] + c2 = [] + + for vertex in self.vertex_list.values(): + if vertex.color == 1: + c1.append(vertex) + elif vertex.color == 2: + c2.append(vertex) + + return c1, c2 + +def try_open(*args): + try: + return open(*args) + except IOError: + die('Failed opening file: {0}'.format(args[0])) + +def float_value(token): + try: + return float(token) + except ValueError: + return None + +def die(message): + print >> sys.stderr, message + sys.exit(1) + +def main(input, randomizations, output): + graph = Graph() + graph.from_edge_file(input) + c1, c2 = graph.bipartite_partition() + + if c1 is None: + die('Graph is not bipartite') + + if len(c1) + len(c2) != graph.vertices: + die('Bipartite partition failed: {0} + {1} != {2}'.format(len(c1), len(c2), graph.vertices)) + + with open(output, 'w') as ofh: + a1 = optimal_assignment(c1, c2, graph.max_weight) + optimal_total_weight = 0.0 + for a in a1: + optimal_total_weight += a[0].neighbors[a[1]] + + print >> ofh, 'optimal average {0:.3f}'.format(optimal_total_weight / len(a1)) + + if randomizations > 0: + random_total_count = 0 + random_total_weight = 0.0 + for i in range(randomizations): + a2 = random_assignment(c1, c2) + random_total_count += len(a2) + for a in a2: + random_total_weight += a[0].neighbors[a[1]] + print >> ofh, 'random average {0:.3f}'.format(random_total_weight / random_total_count) + + + for a in a1: + print >> ofh, '\t'.join([a[0].name, a[1].name]) + +def optimal_assignment(c1, c2, max_weight): + matrix = [] + assignment = [] + + for v1 in c1: + row = [] + for v2 in c2: + row.append(max_weight + 1.0 - v1.neighbors[v2]) + matrix.append(row) + + m = munkres.Munkres() + indexes = m.compute(matrix) + for row, column in indexes: + assignment.append([c1[row], c2[column]]) + + return assignment + +def random_assignment(c1, c2): + assignment = [] + + ## note, this assumes that graph is complete bipartite + ## this needs to be fixed + c1_len = len(c1) + c2_len = len(c2) + idx_list = list(range(max(c1_len, c2_len))) + random.shuffle(idx_list) + + if c1_len <= c2_len: + for i, v1 in enumerate(c1): + assignment.append([v1, c2[idx_list[i]]]) + else: + for i, v1 in enumerate(c2): + assignment.append([v1, c1[idx_list[i]]]) + + return assignment + +################################################################################ + +if len(sys.argv) != 4: + die('Usage') + +input, randomizations, output = sys.argv[1:] +main(input, int(randomizations), output) |
b |
diff -r 4188853b940b -r a631c2f6d913 assignment_of_optimal_breeding_pairs.xml --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/assignment_of_optimal_breeding_pairs.xml Fri Sep 20 13:25:27 2013 -0400 |
b |
@@ -0,0 +1,54 @@ +<tool id="gd_assignment_of_optimal_breeding_pairs" name="Matings" version="1.0.0"> + <description>: Assignment of optimal breeding pairs</description> + + <command interpreter="python"> + assignment_of_optimal_breeding_pairs.py '$input' '$randomizations' '$output' + </command> + + <inputs> + <param name="input" type="data" format="txt" label="Pairs dataset" /> + <param name="randomizations" type="integer" min="0" value="0" label="Randomizations" /> + </inputs> + + <outputs> + <data name="output" format="txt" /> + </outputs> + + <requirements> + <requirement type="package" version="1.0.5.4">munkres</requirement> + </requirements> + + <!-- + <tests> + </tests> + --> + + <help> + +**Dataset formats** + +The input and output datasets are in text_ format. + +.. _text: ./static/formatHelp.html#text + +The pairs dataset consists of lines of the form:: + + name1 name2 prob + +as generated by either of the "Offspring estimated heterozygosity" tools. + +----- + +**What it does** + +The user supplies the offspring estimated heterozygosity for every +potential breeding pair, i.e., the expected fraction of autosomal SNPs +for which an offspring is heterozygous. The tool assigns breeding +pairs to maximize the average estimated heterozygosity of the offspring. +Optionally, the user can specify a number of random assigned pairings, +for which the program reports the average estimated heterozygosity +of the offspring; this gives a comparison of the optimal and average +heterozygosity resulting from an assignment of breeding pairs. + + </help> +</tool> |
b |
diff -r 4188853b940b -r a631c2f6d913 average_fst.xml --- a/average_fst.xml Fri Jul 26 12:51:13 2013 -0400 +++ b/average_fst.xml Fri Sep 20 13:25:27 2013 -0400 |
b |
@@ -85,6 +85,10 @@ <data name="output" format="txt" /> </outputs> + <requirements> + <requirement type="package" version="0.1">gd_c_tools</requirement> + </requirements> + <tests> <test> <param name="input" value="test_in/sample.gd_snp" ftype="gd_snp" /> |
b |
diff -r 4188853b940b -r a631c2f6d913 cluster_kegg.xml --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/cluster_kegg.xml Fri Sep 20 13:25:27 2013 -0400 |
b |
@@ -0,0 +1,46 @@ +<tool id="gd_cluster_kegg" name="Cluster KEGG" version="1.0.0"> + <description>: Group gene categories connected by shared genes</description> + + <command interpreter="python"> + #set $ensembltcolmn_arg = int(str($ensembltcolmn)) - 1 + cluster_onConnctdComps.py '--input=$input' '--input_columns=${input.dataset.metadata.columns}' '--outfile=$output' '--threshold=$threshold' '--ENSEMBLTcolmn=$ensembltcolmn_arg' '--classClmns=$classclmns' + </command> + + <inputs> + <param name="input" type="data" format="tabular" label="Input dataset" /> + <param name="ensembltcolmn" type="data_column" data_ref="input" numerical="false" label="Column with the ENSEMBL code in the Input dataset" /> + <param name="threshold" type="float" value="90" min="0" max="100" label="Threshold to disconnect the nodes" /> + <param name="classclmns" size="10" type="text" value="c1,c2" label="Gene category columns"/> + + </inputs> + + <outputs> + <data name="output" format="tabular" /> + </outputs> + + <requirements> + <requirement type="package" version="1.8.1">networkx</requirement> + </requirements> + + <help> +**What it does** + +The program builds a network of gene categories connected by shared +genes. The edges of this network are weighted based on the number of +genes that each node shares. The clustering coefficient, c\ :sub:`u`\ , is then calculated for each node using the formula: + +.. image:: $PATH_TO_IMAGES/cluster_kegg_formula.png + +| + +where deg(u) is the degree of u and edge weights, w\ :sub:`uv`\ , +are normalized by the maximum weight in the network. The cluster +coefficients are then filtered by our program based on threshold (that +could be a percentile or a value choose by the user) and all the nodes +with a cluster coefficient lower than this threshold are deleted from +the network. Finally, the program reports each connected component as +a cluster of gene classifications. With our program a lower number of +gene categories is obtained, but the results are easier to interpret as +they exclude genes present in many gene groups. + </help> +</tool> |
b |
diff -r 4188853b940b -r a631c2f6d913 cluster_onConnctdComps.py --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/cluster_onConnctdComps.py Fri Sep 20 13:25:27 2013 -0400 |
[ |
@@ -0,0 +1,223 @@ +#!/usr/bin/env python +# -*- coding: utf-8 -*- +# +# Cluster_GOKEGG.py +# +# Copyright 2013 Oscar Reina <oscar@niska.bx.psu.edu> +# +# This program is free software; you can redistribute it and/or modify +# it under the terms of the GNU General Public License as published by +# the Free Software Foundation; either version 2 of the License, or +# (at your option) any later version. +# +# This program is distributed in the hope that it will be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +# GNU General Public License for more details. +# +# You should have received a copy of the GNU General Public License +# along with this program; if not, write to the Free Software +# Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, +# MA 02110-1301, USA. + +import argparse +import os +from networkx import connected_components,Graph,clustering +from numpy import percentile +from decimal import Decimal,getcontext +from itertools import permutations,combinations +import sys + +def rtrnClustrsOnCltrCoff(dNodesWeightMin,threshold,perctile=True): + """ + From a file with three columns: nodeA, nodeB and a score, it returns + the strong and weak connected components produced when the edges + below the percentage threshold (or value) are excluded. + """ + #~ + Gmin = Graph() + for nodeA,nodeB in dNodesWeightMin: + wMin=dNodesWeightMin[nodeA,nodeB] + Gmin.add_edge(nodeA,nodeB,weight=wMin) + #~ + clstrCoffcMin=clustering(Gmin,weight='weight') + #~ + if perctile: + umbralMin=percentile(clstrCoffcMin.values(),threshold) + else: + umbralMin=threshold + #~ + GminNdsRmv=[x for x in clstrCoffcMin if clstrCoffcMin[x]<umbralMin] + #~ + Gmin.remove_nodes_from(GminNdsRmv) + #~ + dTermCmptNumbWkMin=rtrndata(Gmin) + #~ + salelClustr=[] + srtdterms=sorted(dTermCmptNumbWkMin.keys()) + for echTerm in srtdterms: + try: + MinT=dTermCmptNumbWkMin[echTerm] + except: + MinT='-' + salelClustr.append('\t'.join([echTerm,MinT])) + #~ + return salelClustr + +def rtrndata(G): + """ + returna list of terms and its clustering, as well as clusters from + a networkx formatted file. + """ + #~ + cntCompnts=0 + dTermCmptNumbWk={} + for echCompnt in connected_components(G): + cntCompnts+=1 + #print '.'.join(echCompnt) + for echTerm in echCompnt: + dTermCmptNumbWk[echTerm]=str(cntCompnts) + #~ + return dTermCmptNumbWk + +def rtrnCATcENSEMBLc(inCATfile,classClmns,ENSEMBLTcolmn,nonHdr=True): + """ + return a dictionary of all the categories in an input file with + a set of genes. Takes as input a file with categories an genes. + """ + dCAT={} + dENSEMBLTCAT={} + for eachl in open(inCATfile,'r'): + if nonHdr and eachl.strip(): + ENSEMBLT=eachl.splitlines()[0].split('\t')[ENSEMBLTcolmn] + sCAT=set() + for CATcolmn in classClmns: + sCAT.update(set([x for x in eachl.splitlines()[0].split('\t')[CATcolmn].split('.')])) + sCAT=sCAT.difference(set(['','U','N'])) + if len(sCAT)>0: + dENSEMBLTCAT[ENSEMBLT]=sCAT + for CAT in sCAT: + try: + dCAT[CAT].add(ENSEMBLT) + except: + dCAT[CAT]=set([ENSEMBLT]) + nonHdr=True + #~ + dCAT=dict([(x,len(dCAT[x])) for x in dCAT.keys()]) + #~ + return dCAT,dENSEMBLTCAT + + +def calcDistance(sCAT1,sCAT2): + """ + takes as input two set of genesin different categories and returns + a value 1-percentage of gene shared cat1->cat2, and cat2->cat1. + """ + getcontext().prec=5 + lgensS1=Decimal(len(sCAT1)) + lgensS2=Decimal(len(sCAT2)) + shrdGns=sCAT1.intersection(sCAT2) + lenshrdGns=len(shrdGns) + #~ + dC1C2=1-(lenshrdGns/lgensS1) + dC2C1=1-(lenshrdGns/lgensS2) + #~ + return dC1C2,dC2C1 + +def rtnPrwsdtncs(dCAT,dENSEMBLTCAT): + """ + return a mcl formated pairwise distances from a list of categories + """ + #~ + getcontext().prec=5 + dCATdst={} + lENSEMBL=dENSEMBLTCAT.keys() + l=len(lENSEMBL) + c=0 + for ENSEMBL in lENSEMBL: + c+=1 + lCAT=dENSEMBLTCAT.pop(ENSEMBL) + for CAT1,CAT2 in combinations(lCAT, 2): + try: + dCATdst[CAT1,CAT2]+=1 + except: + dCATdst[CAT1,CAT2]=1 + try: + dCATdst[CAT2,CAT1]+=1 + except: + dCATdst[CAT2,CAT1]=1 + #~ + dNodesWeightMin={} + l=len(dCATdst) + for CAT1,CAT2 in dCATdst.keys(): + shrdGns=dCATdst.pop((CAT1,CAT2)) + dC1C2=float(shrdGns) + nodeA,nodeB=sorted([CAT1,CAT2]) + try: + cscor=dNodesWeightMin[nodeA,nodeB] + if cscor>=dC1C2: + dNodesWeightMin[nodeA,nodeB]=dC1C2 + except: + dNodesWeightMin[nodeA,nodeB]=dC1C2 + # + return dNodesWeightMin + +def parse_class_columns(val, max_col): + int_list = [] + + for elem in [x.strip() for x in val.split(',')]: + if elem[0].lower() != 'c': + print >> sys.stderr, "bad column format:", elem + sys.exit(1) + + int_val = as_int(elem[1:]) + + if int_val is None: + print >> sys.stderr, "bad column format:", elem + sys.exit(1) + elif not 1 <= int_val <= max_col: + print >> sys.stderr, "column out of range:", elem + sys.exit(1) + + int_list.append(int_val - 1) + + return int_list + +def as_int(val): + try: + return int(val) + except ValueError: + return None + else: + raise + +def main(): + """ + """ + #~ bpython cluster_onConnctdComps.py --input=../conctFinal_CEU.tsv --outfile=../borrar.txt --threshold=90 --ENSEMBLTcolmn=1 --classClmns='20 22' + parser = argparse.ArgumentParser(description='Returns the count of genes in ...') + parser.add_argument('--input',metavar='input TXT file',type=str,help='the input file with the table in txt format.',required=True) + parser.add_argument('--input_columns',metavar='input INT value',type=int,help='the number of columns in the input file.',required=True) + parser.add_argument('--outfile',metavar='input TXT file',type=str,help='the output file with the connected components.',required=True) + parser.add_argument('--threshold',metavar='input FLOAT value',type=float,help='the threshold to disconnect the nodes.',required=True) + parser.add_argument('--ENSEMBLTcolmn',metavar='input INT file',type=int,help='the column with the ENSEMBLE code in the input.',required=True) + parser.add_argument('--classClmns',metavar='input STR value',type=str,help='the list of columns with the gene categories separated by space.',required=True) + args = parser.parse_args() + infile = args.input + threshold = args.threshold + outfile = args.outfile + ENSEMBLTcolmn = args.ENSEMBLTcolmn + classClmns = parse_class_columns(args.classClmns, args.input_columns) + #~ + dCAT,dENSEMBLTCAT=rtrnCATcENSEMBLc(infile,classClmns,ENSEMBLTcolmn) + dNodesWeightMin=rtnPrwsdtncs(dCAT,dENSEMBLTCAT) + salelClustr=rtrnClustrsOnCltrCoff(dNodesWeightMin,threshold) + #~ + with open(outfile, 'w') as salef: + print >> salef, '\n'.join(salelClustr) + #~ + #~ + +if __name__ == '__main__': + main() + |
b |
diff -r 4188853b940b -r a631c2f6d913 coverage_distributions.xml --- a/coverage_distributions.xml Fri Jul 26 12:51:13 2013 -0400 +++ b/coverage_distributions.xml Fri Sep 20 13:25:27 2013 -0400 |
b |
@@ -57,6 +57,10 @@ <data name="output" format="html" /> </outputs> + <requirements> + <requirement type="package" version="0.1">gd_c_tools</requirement> + </requirements> + <tests> <test> <param name="input" value="test_in/sample.gd_snp" ftype="gd_snp" /> @@ -121,7 +125,7 @@ graphical output: -.. image:: ${static_path}/images/gd_coverage.png +.. image:: $PATH_TO_IMAGES/gd_coverage.png </help> </tool> |
b |
diff -r 4188853b940b -r a631c2f6d913 discover_familial_relationships.py --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/discover_familial_relationships.py Fri Sep 20 13:25:27 2013 -0400 |
[ |
@@ -0,0 +1,100 @@ +#!/usr/bin/env python + +import sys +import gd_util + +from Population import Population + +################################################################################ + +if len(sys.argv) != 6: + gd_util.die('Usage') + +input, input_type, ind_arg, pop_input, output = sys.argv[1:] + +p_total = Population() +p_total.from_wrapped_dict(ind_arg) + +p1 = Population() +p1.from_population_file(pop_input) +if not p_total.is_superset(p1): + gd_util.die('There is an individual in the population that is not in the SNP table') + +################################################################################ + +prog = 'kinship_prep' + +args = [ prog ] +args.append(input) # a Galaxy SNP table +args.append(0) # required number of reads for each individual to use a SNP +args.append(0) # required genotype quality for each individual to use a SNP +args.append(0) # minimum spacing between SNPs on the same scaffold + +for tag in p1.tag_list(): + if input_type == 'gd_genotype': + column, name = tag.split(':') + tag = '{0}:{1}'.format(int(column) - 2, name) + args.append(tag) + +gd_util.run_program(prog, args) + +# kinship.map +# kinship.ped +# kinship.dat + +################################################################################ + +prog = 'king' + +args = [ prog ] +args.append('-d') +args.append('kinship.dat') +args.append('-p') +args.append('kinship.ped') +args.append('-m') +args.append('kinship.map') +args.append('--kinship') + +gd_util.run_program(prog, args) + +# king.kin + +################################################################################ + +valid_header = 'FID\tID1\tID2\tN_SNP\tZ0\tPhi\tHetHet\tIBS0\tKinship\tError\n' + +with open('king.kin') as fh: + header = fh.readline() + if header != valid_header: + gd_util.die('crap') + + with open(output, 'w') as ofh: + + for line in fh: + elems = line.split('\t') + if len(elems) != 10: + gd_util.die('crap') + + x = elems[1] + y = elems[2] + z = elems[8] + + f = float(z) + + message = '' + + if f > 0.354: + message = 'duplicate or MZ twin' + elif f >= 0.177: + message = '1st degree relatives' + elif f >= 0.0884: + message = '2nd degree relatives' + elif f >= 0.0442: + message = '3rd degree relatives' + + print >> ofh, '\t'.join([x, y, z, message]) + +################################################################################ + +sys.exit(0) + |
b |
diff -r 4188853b940b -r a631c2f6d913 discover_familial_relationships.xml --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/discover_familial_relationships.xml Fri Sep 20 13:25:27 2013 -0400 |
b |
@@ -0,0 +1,67 @@ +<tool id="gd_discover_familial_relationships" name="Close relatives" version="1.0.0"> + <description>: Discover familial relationships</description> + + <command interpreter="python"> + #import json + #import base64 + #import zlib + #set $ind_names = $input.dataset.metadata.individual_names + #set $ind_colms = $input.dataset.metadata.individual_columns + #set $ind_dict = dict(zip($ind_names, $ind_colms)) + #set $ind_json = json.dumps($ind_dict, separators=(',',':')) + #set $ind_comp = zlib.compress($ind_json, 9) + #set $ind_arg = base64.b64encode($ind_comp) + discover_familial_relationships.py '$input' '$input.ext' '$ind_arg' '$pop_input' '$output' + </command> + + <inputs> + <param name="input" type="data" format="gd_snp,gd_genotype" label="Input dataset" /> + <param name="pop_input" type="data" format="gd_indivs" label="Individuals dataset" /> + </inputs> + + <outputs> + <data name="output" format="tabular" /> + </outputs> + + <requirements> + <requirement type="package" version="0.1">gd_c_tools</requirement> + </requirements> + + <!-- + <tests> + </tests> + --> + + <help> + +**Dataset formats** + +The input datasets are in gd_snp_, gd_genotype_, and gd_indivs_ formats. +The output dataset is in tabular_ format. + +.. _gd_snp: ./static/formatHelp.html#gd_snp +.. _gd_genotype: ./static/formatHelp.html#gd_genotype +.. _gd_indivs: ./static/formatHelp.html#gd_indivs +.. _tabular: ./static/formatHelp.html#tab + +----- + +**What it does** + +The user specifies a SNP table (either gd_snp or gd_genotype format) and +a set of individuals. The command runs the KING program (Manichaikul et +al., 2010) to look for pairs of distinct individuals in the specified +set that have a close family relationship. Putatively related pairs +are classified into five categories: + + 1. duplicate or MZ twin + 2. 1st degree relatives -- siblings (other than identical twins) or parent-child + 3. 2nd degree relatives -- e.g., half-siblings, grandparent-grandchild pair, individual-uncle/aunt pair + 4. 3rd degree relatives -- e.g., first cousins + 5. unrelated + +Reference: + +Manichaikul A, Mychaleckyj JC, Rich SS, Daly K, Sale M, Chen WM (2010) Robust relationship inference in genome-wide association studies. Bioinformatics 26: 2867-2873. + </help> +</tool> |
b |
diff -r 4188853b940b -r a631c2f6d913 diversity_pi.py --- a/diversity_pi.py Fri Jul 26 12:51:13 2013 -0400 +++ b/diversity_pi.py Fri Sep 20 13:25:27 2013 -0400 |
[ |
@@ -6,30 +6,55 @@ ################################################################################ -if len(sys.argv) != 7: - gd_util.die('Usage') - -snp_input, coverage_input, indiv_input, min_coverage, output, ind_arg = sys.argv[1:] +def load_pop(file, wrapped_dict): + if file == '/dev/null': + pop = None + else: + pop = Population() + pop.from_wrapped_dict(wrapped_dict) + return pop -p_total = Population() -p_total.from_wrapped_dict(ind_arg) - -p1 = Population() -p1.from_population_file(indiv_input) -if not p_total.is_superset(p1): - gd_util.die('There is an individual in the population individuals that is not in the SNP table') +def append_tags(the_list, p, p_type, val): + if p is None: + return + for tag in p.tag_list(): + column, name = tag.split(':') + if p_type == 'gd_genotype': + column = int(column) - 2 + the_list.append('{0}:{1}:{2}'.format(val, column, name)) ################################################################################ -prog = 'mt_pi' +if len(sys.argv) != 11: + gd_util.die('Usage') + +snp_input, snp_ext, snp_arg, cov_input, cov_ext, cov_arg, indiv_input, min_coverage, req_thresh, output = sys.argv[1:] + +p_snp = load_pop(snp_input, snp_arg) +p_cov = load_pop(cov_input, cov_arg) + +p_ind = Population() +p_ind.from_population_file(indiv_input) + +if not p_snp.is_superset(p_ind): + gd_util.die('There is an individual in the population individuals that is not in the SNP/Genotype table') + +if p_cov is not None and (not p_cov.is_superset(p_ind)): + gd_util.die('There is an individual in the population individuals that is not in the Coverage table') + +################################################################################ + +prog = 'mito_pi' args = [ prog ] args.append(snp_input) -args.append(coverage_input) +args.append(cov_input) args.append(min_coverage) +args.append(req_thresh) -for tag in p1.tag_list(): - args.append(tag) +append_tags(args, p_ind, 'gd_indivs', 0) +append_tags(args, p_snp, snp_ext, 1) +append_tags(args, p_cov, cov_ext, 2) with open(output, 'w') as fh: gd_util.run_program(prog, args, stdout=fh) |
b |
diff -r 4188853b940b -r a631c2f6d913 diversity_pi.xml --- a/diversity_pi.xml Fri Jul 26 12:51:13 2013 -0400 +++ b/diversity_pi.xml Fri Sep 20 13:25:27 2013 -0400 |
b |
@@ -1,30 +1,73 @@ -<tool id="gd_diversity_pi" name="Diversity" version="1.0.0"> - <description>&pi;</description> +<tool id="gd_diversity_pi" name="Diversity" version="1.1.0"> + <description>: pi, allowing for unsequenced intervals</description> <command interpreter="python"> #import json #import base64 #import zlib - #set $ind_names = $input.dataset.metadata.individual_names - #set $ind_colms = $input.dataset.metadata.individual_columns - #set $ind_dict = dict(zip($ind_names, $ind_colms)) - #set $ind_json = json.dumps($ind_dict, separators=(',',':')) - #set $ind_comp = zlib.compress($ind_json, 9) - #set $ind_arg = base64.b64encode($ind_comp) - diversity_pi.py '$input' '$coverage_input' '$indiv_input' '$min_coverage' '$output' '$ind_arg' + #set $snp_names = $input.dataset.metadata.individual_names + #set $snp_colms = $input.dataset.metadata.individual_columns + #set $snp_dict = dict(zip($snp_names, $snp_colms)) + #set $snp_json = json.dumps($snp_dict, separators=(',',':')) + #set $snp_comp = zlib.compress($snp_json, 9) + #set $snp_arg = base64.b64encode($snp_comp) + #if $use_cov.choice == '1' + #set $cov_file = $use_cov.cov_input + #set $cov_ext = $use_cov.cov_input.ext + #set $cov_names = $use_cov.cov_input.dataset.metadata.individual_names + #set $cov_colms = $use_cov.cov_input.dataset.metadata.individual_columns + #set $cov_dict = dict(zip($cov_names, $cov_colms)) + #set $cov_json = json.dumps($cov_dict, separators=(',',':')) + #set $cov_comp = zlib.compress($cov_json, 9) + #set $cov_arg = base64.b64encode($cov_comp) + #set $cov_min = $use_cov.min_coverage + #set $cov_req = $use_cov.req_thresh + #else + #set $cov_file = '/dev/null' + #set $cov_ext = '' + #set $cov_arg = '' + #set $cov_min = 0 + #set $cov_req = 0 + #end if + diversity_pi.py '$input' '$input.ext' '$snp_arg' '$cov_file' '$cov_ext' '$cov_arg' '$indiv_input' '$cov_min' '$cov_req' '$output' </command> <inputs> - <param name="input" type="data" format="gd_snp" label="SNP dataset" /> - <param name="coverage_input" type="data" format="interval" label="Coverage dataset" /> + <param name="input" type="data" format="gd_snp,gd_genotype" label="SNP/Genotype dataset" /> + <conditional name="use_cov"> + <param name="choice" type="select" format="integer" label="Include Coverage dataset"> + <option value="1" selected="true">yes</option> + <option value="0">no</option> + </param> + <when value="0" /> + <when value="1"> + <param name="cov_input" type="data" format="gd_snp,gd_genotype" label="Coverage dataset" /> + <param name="min_coverage" type="integer" min="1" value="1" label="Minimum coverage" /> + <param name="req_thresh" type="integer" min="1" value="1" label="Lower bound for shared well-covered bp" /> + </when> + </conditional> <param name="indiv_input" type="data" format="gd_indivs" label="Population Individuals" /> - <param name="min_coverage" type="integer" min="1" value="1" label="Minimum coverage" /> </inputs> <outputs> <data name="output" format="txt" metadata_source="input" /> </outputs> + <requirements> + <requirement type="package" version="0.1">gd_c_tools</requirement> + </requirements> + <help> +**What it does** + +The user supplies the following: + + 1. A file in gd_genotype or gd_snp format giving the mitochondrial SNPs. + 2. An optional gd_genotype file gives the sequence coverage for each individual at each mitochondrial position. + 3. A set of individuals specified with the "Specify individuals" tool. + 4. The minimum depth of sequence coverage. Positions where an individual has less coverage are ignored. + 5. The number of adequately covered positions that must be shared by two individuals before their diversity is included in the reported average. + +For each pair of individual (with adequate shared coverage), the program divides the number of nucleotide difference between the individuals in those intervals by the intervals' total length. Those ratios are averaged over the relevant pairs of individuals. </help> </tool> |
b |
diff -r 4188853b940b -r a631c2f6d913 dpmix.py --- a/dpmix.py Fri Jul 26 12:51:13 2013 -0400 +++ b/dpmix.py Fri Sep 20 13:25:27 2013 -0400 |
[ |
@@ -8,6 +8,20 @@ from dpmix_plot import make_dpmix_plot from LocationFile import LocationFile +def load_and_check_pop(name, file, total_pop): + p = Population(name=name) + p.from_population_file(file) + if not total_pop.is_superset(p): + gd_util.die('There is an individual in {0} that is not in the SNP table'.format(name)) + return p + +def append_pop_tags(the_list, p, input_type, number): + for tag in p.tag_list(): + column, name = tag.split(':') + if input_type == 'gd_genotype': + column = int(column) - 2 + the_list.append('{0}:{1}:{2}'.format(column, number, name)) + ################################################################################ if len(sys.argv) != 22: @@ -16,6 +30,11 @@ input, input_type, data_source, switch_penalty, ap1_input, ap1_name, ap2_input, ap2_name, ap3_input, ap3_name, p_input, output, output2, output2_dir, dbkey, ref_column, galaxy_data_index_dir, heterochromatin_loc_file, ind_arg, het_arg, add_logs = sys.argv[1:] +if ap1_input == '/dev/null': + use_reference = True +else: + use_reference = False + if ap3_input == '/dev/null': populations = 2 else: @@ -39,30 +58,19 @@ p_total = Population() p_total.from_wrapped_dict(ind_arg) -ap1 = Population(name='Ancestral population 1') -ap1.from_population_file(ap1_input) -population_list.append(ap1) -if not p_total.is_superset(ap1): - gd_util.die('There is an individual in ancestral population 1 that is not in the SNP table') +if not use_reference: + ap1 = load_and_check_pop('Ancestral population 1', ap1_input, p_total) + population_list.append(ap1) -ap2 = Population(name='Ancestral population 2') -ap2.from_population_file(ap2_input) +ap2 = load_and_check_pop('Ancestral population 2', ap2_input, p_total) population_list.append(ap2) -if not p_total.is_superset(ap2): - gd_util.die('There is an individual in ancestral population 2 that is not in the SNP table') if populations == 3: - ap3 = Population(name='Ancestral population 3') - ap3.from_population_file(ap3_input) + ap3 = load_and_check_pop('Ancestral population 3', ap3_input, p_total) population_list.append(ap3) - if not p_total.is_superset(ap3): - gd_util.die('There is an individual in ancestral population 3 that is not in the SNP table') -p = Population(name='Potentially admixed') -p.from_population_file(p_input) +p = load_and_check_pop('Potentially admixed', p_input, p_total) population_list.append(p) -if not p_total.is_superset(p): - gd_util.die('There is an individual in the population that is not in the SNP table') gd_util.mkdir_p(output2_dir) @@ -84,42 +92,17 @@ args.append(heterochrom_path) args.append(misc_file) -columns = ap1.column_list() -for column in columns: - col = int(column) - name = ap1.individual_with_column(column).name - first_token = name.split()[0] - if input_type == 'gd_genotype': - col -= 2 - args.append('{0}:1:{1}'.format(col, first_token)) +if use_reference: + args.append('0:1:reference') +else: + append_pop_tags(args, ap1, input_type, 1) -columns = ap2.column_list() -for column in columns: - col = int(column) - name = ap2.individual_with_column(column).name - first_token = name.split()[0] - if input_type == 'gd_genotype': - col -= 2 - args.append('{0}:2:{1}'.format(col, first_token)) +append_pop_tags(args, ap2, input_type, 2) if populations == 3: - columns = ap3.column_list() - for column in columns: - col = int(column) - name = ap3.individual_with_column(column).name - first_token = name.split()[0] - if input_type == 'gd_genotype': - col -= 2 - args.append('{0}:3:{1}'.format(col, first_token)) + append_pop_tags(args, ap3, input_type, 3) -columns = p.column_list() -for column in columns: - col = int(column) - name = p.individual_with_column(column).name - first_token = name.split()[0] - if input_type == 'gd_genotype': - col -= 2 - args.append('{0}:0:{1}'.format(col, first_token)) +append_pop_tags(args, p, input_type, 0) with open(output, 'w') as fh: gd_util.run_program(prog, args, stdout=fh) |
b |
diff -r 4188853b940b -r a631c2f6d913 dpmix.xml --- a/dpmix.xml Fri Jul 26 12:51:13 2013 -0400 +++ b/dpmix.xml Fri Sep 20 13:25:27 2013 -0400 |
b |
@@ -1,4 +1,4 @@ -<tool id="gd_dpmix" name="Admixture" version="1.1.0"> +<tool id="gd_dpmix" name="Admixture" version="1.2.0"> <description>: Map genomic intervals resembling specified source populations</description> <command interpreter="python"> @@ -31,7 +31,13 @@ #else if $user_het.choice == '2' #set $het_arg = 'use_none' #end if - '$switch_penalty' '$ap1_input' '$ap1_input.name' '$ap2_input' '$ap2_input.name' '$ap3_arg' '$ap3_name_arg' '$p_input' '$output' '$output2' '$output2.files_path' '$input.dataset.metadata.dbkey' '$input.dataset.metadata.ref' '$GALAXY_DATA_INDEX_DIR' 'gd.heterochromatic.loc' '$ind_arg' '$het_arg' '1' + '$switch_penalty' + #if $use_reference.choice == '0' + '$ap1_input' '$ap1_input.name' + #else if $use_reference.choice == '1' + '/dev/null' 'reference' + #end if + '$ap2_input' '$ap2_input.name' '$ap3_arg' '$ap3_name_arg' '$p_input' '$output' '$output2' '$output2.files_path' '$input.dataset.metadata.dbkey' '$input.dataset.metadata.ref' '$GALAXY_DATA_INDEX_DIR' 'gd.heterochromatic.loc' '$ind_arg' '$het_arg' '$add_logs' </command> <inputs> @@ -57,7 +63,17 @@ </when> </conditional> - <param name="ap1_input" type="data" format="gd_indivs" label="Source population 1 individuals" /> + <conditional name="use_reference"> + <param name="choice" type="select" format="integer" label="History item or Reference sequence"> + <option value="0" selected="true">History item</option> + <option value="1">Reference sequence</option> + </param> + <when value="0"> + <param name="ap1_input" type="data" format="gd_indivs" label="Source population 1 individuals" /> + </when> + <when value="1" /> + </conditional> + <param name="ap2_input" type="data" format="gd_indivs" label="Source population 2 individuals" /> <conditional name="third_pop"> @@ -87,12 +103,10 @@ </when> </conditional> - <!-- <param name="add_logs" type="select" format="integer" label="Probabilities"> <option value="1" selected="true">add logs of probabilities</option> <option value="0">add probabilities</option> </param> - --> </inputs> @@ -101,6 +115,11 @@ <data name="output2" format="html" /> </outputs> + <requirements> + <requirement type="package" version="0.1">gd_c_tools</requirement> + <requirement type="package" version="1.2.1">matplotlib</requirement> + </requirements> + <tests> <test> <param name="input" value="test_in/sample.gd_snp" ftype="gd_snp" /> |
b |
diff -r 4188853b940b -r a631c2f6d913 dpmix_plot.py --- a/dpmix_plot.py Fri Jul 26 12:51:13 2013 -0400 +++ b/dpmix_plot.py Fri Sep 20 13:25:27 2013 -0400 |
[ |
b"@@ -3,8 +3,10 @@\n import os\n import sys\n import math\n+\n import matplotlib as mpl\n mpl.use('PDF')\n+from matplotlib.backends.backend_pdf import PdfPages\n import matplotlib.pyplot as plt\n from matplotlib.path import Path\n import matplotlib.patches as patches\n@@ -226,18 +228,49 @@\n return vals, labels\n \n ################################################################################\n+################################################################################\n+################################################################################\n+################################################################################\n+\n+def space_for_legend(plot_params):\n+ space = 0.0\n+\n+ legend_states = plot_params['legend_states']\n+ if legend_states:\n+ ind_space = plot_params['ind_space']\n+ ind_height = plot_params['ind_height']\n+ space += len(legend_states) * (ind_space + ind_height) - ind_space\n+\n+ return space\n+\n+################################################################################\n+\n+def space_for_chroms(plot_params, chroms, individuals, data):\n+ space_dict = {}\n+\n+ chrom_height = plot_params['chrom_height']\n+ ind_space = plot_params['ind_space']\n+ ind_height = plot_params['ind_height']\n+\n+ for chrom in chroms:\n+ space_dict[chrom] = chrom_height\n+\n+ individual_count = 0\n+ for individual in individuals:\n+ if individual in data[chrom]:\n+ individual_count += 1\n+\n+ space_dict[chrom] += individual_count * (ind_space + ind_height)\n+\n+ return space_dict\n+\n+################################################################################\n \n def make_dpmix_plot(input_dbkey, input_file, output_file, galaxy_data_index_dir, state2name=None, populations=3):\n fs_chrom_len = build_chrom_len_dict(input_dbkey, galaxy_data_index_dir)\n chroms, individuals, data, chrom_len, used_states = parse_input_file(input_file)\n \n- if populations == 3:\n- make_state_rectangle = make_state_rectangle_3pop\n- elif populations == 2:\n- make_state_rectangle = make_state_rectangle_2pop\n- else:\n- pass\n-\n+ ## populate chrom_len\n for chrom in chrom_len.keys():\n if chrom in fs_chrom_len:\n chrom_len[chrom] = fs_chrom_len[chrom]\n@@ -245,135 +278,177 @@\n #check_chroms(chroms, chrom_len, input_dbkey)\n check_data(data, chrom_len, input_dbkey)\n \n- ## units below are inches\n- top_space = 0.10\n- chrom_space = 0.25\n- chrom_height = 0.25\n- ind_space = 0.10\n- ind_height = 0.25\n+ ## plot parameters\n+ plot_params = {\n+ 'plot_dpi': 300,\n+ 'page_width': 8.50,\n+ 'page_height': 11.00,\n+ 'top_margin': 0.10,\n+ 'bottom_margin': 0.10,\n+ 'chrom_space': 0.25,\n+ 'chrom_height': 0.25,\n+ 'ind_space': 0.10,\n+ 'ind_height': 0.25,\n+ 'legend_space': 0.10\n+ }\n \n- total_height = 0.0\n-\n- ## make a legend\n- ## only print out states that are\n+ ## in the legend, only print out states that are\n ## 1) in the data\n ## - AND -\n ## 2) in the state2name map\n- ## here, we only calculate the space needed\n legend_states = []\n if state2name is not None:\n for state in used_states:\n if state in state2name:\n legend_states.append(state)\n \n- if legend_states:\n- total_height += len(legend_states) * (ind_space + ind_height)\n- total_height += (top_space - ind_space)\n- at_top = False\n- else:\n- at_top = True\n+ plot_params['legend_states'] = legend_states\n+\n+ ## choose the correct make_state_rectangle method\n+ if populations == 3:\n+ plot_params['rectangle_method'] = make_state_rectangle_3pop\n+ elif populations == 2:\n+ plot_params['rectangle_method'] = make_state_rectangle_2pop\n+\n+ pdf_pages = PdfPages(output_file)\n+\n+\t## generate a list of chroms for each page\n+\n+ needed_for"..b" bottom -= chrom_height/page_height\n+ top = False\n+ else:\n+ bottom -= (chrom_space + chrom_height)/page_height\n \n- if not for_webb:\n- ax = fig.add_axes([0.0, bottom, 1.0, chrom_height/height])\n+ ax = fig.add_axes([0.0, bottom, 1.0, chrom_height/page_height])\n plt.axis('off')\n plt.text(0.5, 0.5, chrom, fontsize=14, ha='center')\n \n- individual_count = 0\n- for individual in individuals:\n- if individual in data[chrom]:\n- individual_count += 1\n+ individual_count = 0\n+ for individual in individuals:\n+ if individual in data[chrom]:\n+ individual_count += 1\n \n- i = 0\n- for individual in individuals:\n- if individual in data[chrom]:\n- i += 1\n+ i = 0\n+ for individual in individuals:\n+ if individual in data[chrom]:\n+ i += 1\n+ bottom -= (ind_space + ind_height)/page_height\n \n- bottom -= (ind_space + ind_height)/height\n- if not for_webb:\n- # [left, bottom, width, height]\n- ax1 = fig.add_axes([0.0, bottom, 0.09, ind_height/height])\n+ ax1 = fig.add_axes([0.0, bottom, 0.09, ind_height/page_height])\n plt.axis('off')\n plt.text(1.0, 0.5, individual, fontsize=10, ha='right', va='center')\n- # [left, bottom, width, height]\n- ax2 = fig.add_axes([0.10, bottom, 0.88, ind_height/height], frame_on=False)\n- ax2.set_xlim(0, length)\n- ax2.set_ylim(0, 1)\n- if i != individual_count:\n- plt.axis('off')\n- else:\n- if not for_webb:\n+\n+ ax2 = fig.add_axes([0.10, bottom, 0.88, ind_height/page_height], frame_on=False)\n+ ax2.set_xlim(0, length)\n+ ax2.set_ylim(0, 1)\n+\n+ if i != individual_count:\n+ plt.axis('off')\n+ else:\n ax2.tick_params(top=False, left=False, right=False, labelleft=False)\n ax2.set_xticks(vals)\n ax2.set_xticklabels(labels)\n- else:\n- plt.axis('off')\n- for p1, p2, state in sorted(data[chrom][individual]):\n- #for patch in make_state_rectangle(p1, p2, state, chrom, individual):\n- for patch in make_state_rectangle(p1, p2, state, chrom, individual):\n- ax2.add_patch(patch)\n+\n+ for p1, p2, state in sorted(data[chrom][individual]):\n+ for patch in make_state_rectangle(p1, p2, state, chrom, individual):\n+ ax2.add_patch(patch)\n \n- plt.savefig(output_file)\n+ # extend last state to end of chrom\n+ if p2 < length:\n+ for patch in make_state_rectangle(p2, length, state, chrom, individual):\n+ ax2.add_patch(patch)\n+\n+\n+ pdf_pages.savefig(fig)\n+ plt.close(fig)\n+\n+ pdf_pages.close()\n \n ################################################################################\n \n if __name__ == '__main__':\n- input_dbkey, input_file, output_file, galaxy_data_index_dir = sys.argv[1:5]\n- make_dpmix_plot(input_dbkey, input_file, output_file, galaxy_data_index_dir)\n+ make_dpmix_plot('loxAfr3', 'output.dat', 'output2_files/picture.pdf', '/scratch/galaxy/home/oocyte/galaxy_oocyte/tool-data', state2name={0: 'heterochromatin', 1: 'reference', 2: 'asian'}, populations=2)\n+# input_dbkey, input_file, output_file, galaxy_data_index_dir = sys.argv[1:5]\n+# make_dpmix_plot(input_dbkey, input_file, output_file, galaxy_data_index_dir)\n sys.exit(0)\n \n ## notes\n" |
b |
diff -r 4188853b940b -r a631c2f6d913 draw_variants.py --- a/draw_variants.py Fri Jul 26 12:51:13 2013 -0400 +++ b/draw_variants.py Fri Sep 20 13:25:27 2013 -0400 |
[ |
@@ -6,35 +6,81 @@ ################################################################################ -if len(sys.argv) != 10: - gd_util.die('Usage') - -snp_input, indel_input, coverage_input, annotation_input, indiv_input, ref_name, min_coverage, output, ind_arg = sys.argv[1:] +def load_pop(file, wrapped_dict): + if file == '/dev/null': + pop = None + else: + pop = Population() + pop.from_wrapped_dict(wrapped_dict) + return pop -p_total = Population() -p_total.from_wrapped_dict(ind_arg) - -p1 = Population() -p1.from_population_file(indiv_input) -if not p_total.is_superset(p1): - gd_util.die('There is an individual in the population individuals that is not in the SNP table') +def append_tags(the_list, p, p_type, val): + if p is None: + return + for tag in p.tag_list(): + column, name = tag.split(':') + if p_type == 'gd_genotype': + column = int(column) - 2 + the_list.append('{0}:{1}:{2}'.format(val, column, name)) ################################################################################ -prog = 'mk_Ji' +if len(sys.argv) != 11: + gd_util.die('Usage') + + +snp_file, snp_ext, snp_arg, indiv_input, annotation_input, cov_file, cov_ext, cov_arg, min_coverage, output = sys.argv[1:] + +p_snp = load_pop(snp_file, snp_arg) +p_cov = load_pop(cov_file, cov_arg) + +if indiv_input == '/dev/null': + if p_snp is not None: + p_ind = p_snp + elif p_cov is not None: + p_ind = p_cov + else: + p_ind = None + order_p_ind = True +else: + p_ind = Population() + p_ind.from_population_file(indiv_input) + order_p_ind = False + +## p ind must be from either p_snp or p_cov +if p_snp is not None and p_cov is not None: + if not (p_snp.is_superset(p_ind) or p_cov.is_superset(p_ind)): + gd_util.die('There is an individual in the population individuals that is not in the SNP/Genotype or Coverage table') +elif p_snp is not None: + if not p_snp.is_superset(p_ind): + gd_util.die('There is an individual in the population individuals that is not in the SNP/Genotype table') +elif p_cov is not None: + if not p_cov.is_superset(p_ind): + gd_util.die('There is an individual in the population individuals that is not in the Coverage table') + + +################################################################################ + +prog = 'mito_draw' args = [ prog ] -args.append(snp_input) -args.append(indel_input) -args.append(coverage_input) +args.append(snp_file) +args.append(cov_file) args.append(annotation_input) args.append(min_coverage) -args.append(ref_name) -for tag in p1.tag_list(): - args.append(tag) +if order_p_ind: + for column in sorted(p_ind.column_list()): + individual = p_ind.individual_with_column(column) + name = individual.name.split()[0] + args.append('{0}:{1}:{2}'.format(0, column, name)) +else: + append_tags(args, p_ind, 'gd_indivs', 0) -with open('mk_Ji.out', 'w') as fh: +append_tags(args, p_snp, snp_ext, 1) +append_tags(args, p_cov, cov_ext, 2) + +with open('Ji.spec', 'w') as fh: gd_util.run_program(prog, args, stdout=fh) ################################################################################ @@ -48,9 +94,20 @@ args.append(0.3) args.append('-g') args.append(0.2) -args.append('mk_Ji.out') +args.append('Ji.spec') -with open(output, 'w') as fh: +with open('Ji.svg', 'w') as fh: gd_util.run_program(prog, args, stdout=fh) +################################################################################ + +prog = 'convert' + +args = [ prog ] +args.append('-density') +args.append(100) +args.append('Ji.svg') +args.append('tiff:{0}'.format(output)) + +gd_util.run_program(prog, args) sys.exit(0) |
b |
diff -r 4188853b940b -r a631c2f6d913 draw_variants.xml --- a/draw_variants.xml Fri Jul 26 12:51:13 2013 -0400 +++ b/draw_variants.xml Fri Sep 20 13:25:27 2013 -0400 |
b |
@@ -1,41 +1,102 @@ -<tool id="gd_draw_variants" name="Draw" version="1.0.0"> - <description>variants</description> +<tool id="gd_draw_variants" name="Draw variants" version="1.1.0"> + <description>: show positions of SNVs and unsequenced intervals</description> <command interpreter="python"> #import json #import base64 #import zlib - #set $ind_names = $input.dataset.metadata.individual_names - #set $ind_colms = $input.dataset.metadata.individual_columns - #set $ind_dict = dict(zip($ind_names, $ind_colms)) - #set $ind_json = json.dumps($ind_dict, separators=(',',':')) - #set $ind_comp = zlib.compress($ind_json, 9) - #set $ind_arg = base64.b64encode($ind_comp) - draw_variants.py '$input' '$indel_input' '$coverage_input' '$annotation_input' '$indiv_input' '$ref_name' '$min_coverage' '$output' '$ind_arg' + #if $use_snp.choice == '1' + #set $snp_file = $use_snp.snp_input + #set $snp_ext = $use_snp.snp_input.ext + #set $snp_names = $use_snp.snp_input.dataset.metadata.individual_names + #set $snp_colms = $use_snp.snp_input.dataset.metadata.individual_columns + #set $snp_dict = dict(zip($snp_names, $snp_colms)) + #set $snp_json = json.dumps($snp_dict, separators=(',',':')) + #set $snp_comp = zlib.compress($snp_json, 9) + #set $snp_arg = base64.b64encode($snp_comp) + #else + #set $snp_file = '/dev/null' + #set $snp_ext = '' + #set $snp_arg = '' + #end if + #if $use_cov.choice == '1' + #set $cov_file = $use_cov.cov_input + #set $cov_ext = $use_cov.cov_input.ext + #set $cov_names = $use_cov.cov_input.dataset.metadata.individual_names + #set $cov_colms = $use_cov.cov_input.dataset.metadata.individual_columns + #set $cov_dict = dict(zip($cov_names, $cov_colms)) + #set $cov_json = json.dumps($cov_dict, separators=(',',':')) + #set $cov_comp = zlib.compress($cov_json, 9) + #set $cov_arg = base64.b64encode($cov_comp) + #set $cov_min = $use_cov.min_coverage + #else + #set $cov_file = '/dev/null' + #set $cov_ext = '' + #set $cov_arg = '' + #set $cov_min = 0 + #end if + #if $use_indiv.choice == '1' + #set $ind_arg = $use_indiv.indiv_input + #else + #set $ind_arg = '/dev/null' + #end if + draw_variants.py '$snp_file' '$snp_ext' '$snp_arg' '$ind_arg' '$annotation_input' '$cov_file' '$cov_ext' '$cov_arg' '$cov_min' '$output' </command> <inputs> - <param name="input" type="data" format="gd_snp" label="SNP dataset" /> - <param name="indel_input" type="data" format="gd_snp" label="Indel dataset" /> - <param name="coverage_input" type="data" format="interval" label="Coverage dataset" /> + <conditional name="use_snp"> + <param name="choice" type="select" format="integer" label="Include SNP/Genotype dataset"> + <option value="1" selected="true">yes</option> + <option value="0">no</option> + </param> + <when value="0" /> + <when value="1"> + <param name="snp_input" type="data" format="gd_snp,gd_genotype" label="SNP/Genotype dataset" /> + </when> + </conditional> + <conditional name="use_cov"> + <param name="choice" type="select" format="integer" label="Include Coverage dataset"> + <option value="1" selected="true">yes</option> + <option value="0">no</option> + </param> + <when value="0" /> + <when value="1"> + <param name="cov_input" type="data" format="gd_snp,gd_genotype" label="Coverage dataset" /> + <param name="min_coverage" type="integer" min="1" value="1" label="Minimum coverage" /> + </when> + </conditional> + <conditional name="use_indiv"> + <param name="choice" type="select" label="Compute for"> + <option value="0" selected="true">All individuals</option> + <option value="1">Individuals in a population</option> + </param> + <when value="0" /> + <when value="1"> + <param name="indiv_input" type="data" format="gd_indivs" label="Population Individuals" /> + </when> + </conditional> <param name="annotation_input" type="data" format="interval" label="Annotation dataset" /> - <param name="indiv_input" type="data" format="gd_indivs" label="Population Individuals" /> - - <param name="ref_name" type="select" label="Ref name"> - <options from_dataset="indiv_input"> - <column name="name" index="1"/> - <column name="value" index="1"/> - <filter type="add_value" name="default" value="default" index="0" /> - </options> - </param> - - <param name="min_coverage" type="integer" min="1" value="1" label="Minimum coverage" /> </inputs> <outputs> - <data name="output" format="svg" /> + <data name="output" format="tiff" /> </outputs> + <requirements> + <requirement type="package" version="0.1">gd_c_tools</requirement> + </requirements> + <help> +**What it does** + +The user supplies the following: + + 1. A optional file in gd_genotype or gd_snp format giving the mitochondrial SNPs. + 2. An optional gd_genotype file gives the sequence coverage for each individual at each mitochondrial position. + 3. The minimum depth of sequence coverage. Positions where an individual has less coverage are ignoried. + 4. A set of individuals specified with the "Specify individuals" tool. + 5. A file of annotation for the reference mitochondrial sequence. + +The program draws a picture indicating the locations of SNPs and the inadequately covered interval. </help> </tool> |
b |
diff -r 4188853b940b -r a631c2f6d913 filter_gd_snp.xml --- a/filter_gd_snp.xml Fri Jul 26 12:51:13 2013 -0400 +++ b/filter_gd_snp.xml Fri Sep 20 13:25:27 2013 -0400 |
b |
@@ -67,6 +67,10 @@ <data name="output" format="input" format_source="input" metadata_source="input" /> </outputs> + <requirements> + <requirement type="package" version="0.1">gd_c_tools</requirement> + </requirements> + <tests> <test> <param name="input" value="test_in/sample.gd_snp" ftype="gd_snp" /> |
b |
diff -r 4188853b940b -r a631c2f6d913 find_intervals.xml --- a/find_intervals.xml Fri Jul 26 12:51:13 2013 -0400 +++ b/find_intervals.xml Fri Sep 20 13:25:27 2013 -0400 |
b |
@@ -69,6 +69,10 @@ </data> </outputs> + <requirements> + <requirement type="package" version="0.1">gd_c_tools</requirement> + </requirements> + <tests> <test> <param name="input" value="test_in/sample.gd_snp" ftype="gd_snp" /> |
b |
diff -r 4188853b940b -r a631c2f6d913 gd_snp2vcf.pl --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/gd_snp2vcf.pl Fri Sep 20 13:25:27 2013 -0400 |
[ |
@@ -0,0 +1,222 @@ +#!/usr/bin/perl -w +use strict; + +#convert from gd_snp file to vcf file (with dbSNP fields) + +#gd_snp table format: +#1. chr +#2. position (0 based) +#3. ref allele +#4. second allele +#5. overall quality +#foreach individual (6-9, 10-13, ...) +#a. count of allele in 3 +#b. count of allele in 4 +#c. genotype call (-1, or count of ref allele) +#d. quality of genotype call (quality of non-ref allele from masterVar) + +if (!@ARGV) { + print "usage: gd_snp2vcf.pl file.gd_snp[.gz|.bz2] -geno=8[,12:16,20...] -handle=HANDLE -batch=BATCHNAME -ref=REFERENCEID [-bioproj=XYZ -biosamp=ABC -population=POPID[,POPID2...] -chrCol=9 -posCol=9 ] > snpsForSubmission.vcf\n"; + exit; +} + +my $in = shift @ARGV; +my $genoCols = ''; +my $handle; +my $batch; +my $bioproj; +my $biosamp; +my $ref; +my $pop; +my $cr = 0; #allow to use alternate reference? +my $cp = 1; +my $meta; +my $offset = 0; #offset for genotype column, gd_snp vs gd_genotype indivs file +foreach (@ARGV) { + if (/-geno=([0-9,]+)/) { $genoCols .= "$1:"; } + elsif (/-geno=(.*)/) { $genoCols .= readGeno($1); } + elsif (/-off=([0-9])/) { $offset = $1; } + elsif (/-handle=(.*)/) { $handle = $1; } + elsif (/-batch=(.*)/) { $batch = $1; } + elsif (/-bioproj=(.*)/) { $bioproj = $1; } + elsif (/-biosamp=(.*)/) { $biosamp = $1; } + elsif (/-ref=(.*)/) { $ref = $1; } + elsif (/-population=(\S+)/) { $pop = $1; } + elsif (/-chrCol=(\d+)/) { $cr = $1 - 1; } + elsif (/-posCol=(\d+)/) { $cp = $1 - 1; } + elsif (/-metaOut=(.*)/) { $meta = $1; } +} +if ($cr < 0 or $cp < 0) { die "ERROR the column numbers should be 1 based.\n"; } + +#remove trailing delimiters +$genoCols =~ s/,:/:/g; +$genoCols =~ s/[,:]$//; + +my @gnc = split(/,|:/, $genoCols); + +if ($in =~ /.gz$/) { + open(FH, "zcat $in |") or die "Couldn't open $in, $!\n"; +}elsif ($in =~ /.bz2$/) { + open(FH, "bzcat $in |") or die "Couldn't open $in, $!\n"; +}else { + open(FH, $in) or die "Couldn't open $in, $!\n"; +} +my @head = prepHeader(); +if (@head) { + print join("\n", @head), "\n"; + #now column headers + print "#CHROM\tPOS\tID\tREF\tALT\tQUAL\tFILTER\tINFO"; + if (defined $pop) { + $pop =~ s/,$//; + my $t = $pop; + $t =~ s/,/\t/g; + print "\tFORMAT\t$t"; + } + print "\n"; +} +while (<FH>) { + chomp; + if (/^#/) { next; } + if (/^\s*$/) { next; } + my @f = split(/\t/); + #vcf columns: chrom pos id ref alt qual filter info + # info must have VRT=[0-9] 1==SNV 2=indel 6=NoVariation 8=MNV ... + my $vrt = 1; + if ($f[2] !~ /^[ACTG]$/ or $f[3] !~ /^[ACTG]$/) { + die "Sorry this can only do SNV's at this time\n"; + } + if (scalar @gnc == 1 && !defined $pop) { #single genotype column + if (!defined $f[4] or $f[4] == -1) { $f[4] = '.'; } + if ($f[$gnc[0]-1] == 2) { $vrt = 6; } #reference match + if ($f[$gnc[0]-1] == -1) { next; } #no data, don't use + print "$f[$cr]\t$f[$cp]\t$f[$cr];$f[$cp]\t$f[2]\t$f[3]\t$f[4]\t.\tVRT=$vrt\n"; + #TODO? put read counts in comment? + }elsif ($pop) { #do as population + my @cols; + foreach my $gp (split(/:/,$genoCols)) { #foreach population + my @g = split(/,/, $gp); + my $totChrom = 2*(scalar @g); + my $totRef = 0; + foreach my $i (@g) { if (!defined $f[$i-1] or $f[$i-1] == -1) { $totChrom -= 2; next; } $totRef += $f[$i-1]; } + if ($totChrom == $totRef) { $vrt = 6; } + if ($totRef > $totChrom) { die "ERROR likely the wrong column was chosen for genotype\n"; } + my $altCnt = $totChrom - $totRef; + push(@cols, "$totChrom:$altCnt"); + } + print "$f[$cr]\t$f[$cp]\t$f[$cr];$f[$cp]\t$f[2]\t$f[3]\t$f[4]\t.\tVRT=$vrt\tNA:AC\t", join("\t", @cols), "\n"; + }else { #leave allele counts off + my $totChrom = 2*(scalar @gnc); + my $totRef = 0; + foreach my $i (@gnc) { if ($f[$i-1] == -1) { $totChrom -= 2; next; } $totRef += $f[$i-1]; } + if ($totChrom == $totRef) { $vrt = 6; } + print "$f[$cr]\t$f[$cp]\t$f[$cr];$f[$cp]\t$f[2]\t$f[3]\t$f[4]\t.\tVRT=$vrt\n"; + } +} +close FH or die "Couldn't close $in, $!\n"; + +if ($meta) { + open(FH, ">", $meta) or die "Couldn't open $meta, $!\n"; + print FH "TYPE: CONT\n", + "HANDLE: $handle\n", + "NAME: \n", + "FAX: \n", + "TEL: \n", + "EMAIL: \n", + "LAB: \n", + "INST: \n", + "ADDR: \n", + "||\n", + "TYPE: METHOD\n", + "HANDLE: $handle\n", + "ID: \n", + "METHOD_CLASS: Sequence\n", + "TEMPLATE_TYPE: \n", + "METHOD:\n", + "||\n"; + if ($pop) { + my @p = split(/,/, $pop); + foreach my $t (@p) { + print FH + "TYPE: POPULATION\n", + "HANDLE: $handle\n", + "ID: $t\n", + "POPULATION: \n", + "||\n"; + } + } + print FH "TYPE: SNPASSAY\n", + "HANDLE: $handle\n", + "BATCH: $batch\n", + "MOLTYPE: \n", + "METHOD: \n", + "ORGANISM: \n", + "||\n", + "TYPE: SNPPOPUSE | SNPINDUSE\n", + "HANDLE: $handle\n", + "BATCH: \n", + "METHOD: \n", + "||\n"; + + close FH or die "Couldn't close $meta, $!\n"; +} + +exit 0; + +#parse old header and add or create new +sub prepHeader { + my @h; + $h[0] = '##fileformat=VCFv4.1'; + my ($day, $mo, $yr) = (localtime)[3,4,5]; + $mo++; + $yr+=1900; + $h[1] = '##fileDate=' . "$yr$mo$day"; + $h[2] = "##handle=$handle"; + $h[3] = "##batch=$batch"; + my $i = 4; + if ($bioproj) { $h[$i] = "##bioproject_id=$bioproj"; $i++; } + if ($biosamp) { $h[$i] = "##biosample_id=$biosamp"; $i++; } + $h[$i] = "##reference=$ref"; ##reference=GCF_999999.99 + #$i++; + #$h[$i] = '##INFO=<ID=LID, Number=1,Type=string, Description="Unique local variation ID or name for display. The LID provided here combined with the handle must be unique for a particular submitter.">' + $i++; + $h[$i] = '##INFO=<ID=VRT,Number=1,Type=Integer,Description="Variation type,1 - SNV: single nucleotide variation,2 - DIV: deletion/insertion variation,3 - HETEROZYGOUS: variable, but undefined at nucleotide level,4 - STR: short tandem repeat (microsatellite) variation, 5 - NAMED: insertion/deletion variation of named repetitive element,6 - NO VARIATON: sequence scanned for variation, but none observed,7 - MIXED: cluster contains submissions from 2 or more allelic classes (not used) ,8 - MNV: multiple nucleotide variation with all eles of common length greater than 1,9 - Exception">'; + #sometimes have allele freqs? + if (defined $pop) { + $i++; + $h[$i] = "##FORMAT=<ID=NA,Number=1,Type=Integer,Description=\"Number of alleles for the population.\""; + $i++; + $h[$i] = '##FORMAT=<ID=AC,Number=.,Type=Integer,Description="Allele count for each alternate allele.">'; + my @p = split(/,/, $pop); + foreach my $t (@p) { + $i++; + $h[$i] = "##population_id=$t"; + } + } + #PMID? +##INFO=<ID=PMID,Number=.,Type=Integer,Description="PubMed ID linked to variation if available."> + + return @h; +} +####End + +#read genotype columns from a file +sub readGeno { + my $list = shift @_; + my @files = split(/,/, $list); + my $cols=''; + foreach my $file (@files) { + open(FH, $file) or die "Couldn't read $file, $!\n"; + while (<FH>) { + chomp; + my @f = split(/\s+/); + if ($f[0] =~/\D/) { die "ERROR expect an integer for the column\n"; } + $f[0] += $offset; + $cols .= "$f[0],"; + } + close FH; + $cols .= ":"; + } + $cols =~ s/,:$//; + return $cols; +} +####End |
b |
diff -r 4188853b940b -r a631c2f6d913 gd_snp2vcf.xml --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/gd_snp2vcf.xml Fri Sep 20 13:25:27 2013 -0400 |
[ |
@@ -0,0 +1,155 @@ +<tool id="gd_gd_snp2vcf" name="gd_snp to VCF" version="1.1.0" force_history_refresh="True"> + <description>: Convert from gd_snp or gd_genotype to VCF format, for submission to dbSNP</description> + + <command interpreter="perl"> + gd_snp2vcf.pl "$input" -handle=$hand -batch=$batch -ref=$ref -metaOut=$output2 + #if $individuals.choice == '0': + #set $geno = '' + #for $individual_col in $input.dataset.metadata.individual_columns + ##need to check to number of cols per individual + #if $input.ext == "gd_snp": + #set $t = $individual_col + 2 + #else if $input.ext == "gd_genotype": + #set $t = $individual_col + #else: + #set $t = $individual_col + #end if + #set $geno += "%d," % ($t) + #end for + #if $individuals.pall_id != '': + -population=$individuals.pall_id + #end if + #else if $individuals.choice == '1': + #set $geno = '' + #set $pop = '' + #if $input.ext == "gd_snp": + -off=2 + #else if $input.ext == "gd_genotype": + -off=0 + #else: + -off=2 + #end if + #for $population in $individuals.populations + #set $geno += "%s," % ($population.p1_input) + #set $pop += "%s," % ($population.p1_id) + #end for + -population=$pop + #else if $individuals.choice == '2': + #set $geno = $individuals.geno + #end if + -geno=$geno + #if $bioproj.value != '': + -bioproj=$bioproj + #end if + #if $biosamp.value != '': + -biosamp=$biosamp + #end if + > $output + </command> + + <inputs> + <param name="input" type="data" format="gd_snp,gd_genotype" label="SNP dataset" /> + <conditional name="individuals"> + <param name="choice" type="select" label="Generate dataset for"> + <option value="0" selected="true">All individuals</option> + <option value="1">Individuals in populations</option> + <option value="2">A single individual</option> + </param> + <when value="0"> + <param name="pall_id" type="text" size="20" label="ID for this population" help="Leaving this blank will omit allele counts from the output" /> + </when> + <when value="1"> + <repeat name="populations" title="Population" min="1"> + <param name="p1_input" type="data" format="gd_indivs" label="Population individuals" /> + <param name="p1_id" type="text" size="20" label="ID for this population" help="Leaving this blank will omit allele counts from the output" /> + </repeat> + </when> + <when value="2"> + <param name="geno" type="data_column" data_ref="input" label="Column containing genotype" value="8" /> + </when> + </conditional> + <param name="hand" type="text" size="20" label="dbSNP handle" help="If you do not have a handle, request one at http://www.ncbi.nlm.nih.gov/projects/SNP/handle.html" /> + <param name="batch" type="text" size="20" label="Batch ID" help="ID used to tie dbSNP metadata to the VCF submission" /> + <param name="ref" type="text" size="20" label="Reference sequence ID" help="The RefSeq assembly accession.version on which the SNP positions are based (see http://www.ncbi.nlm.nih.gov/assembly/)" /> + <param name="bioproj" type="text" size="20" label="Optional: Registered BioProject ID" /> + <param name="biosamp" type="text" size="20" label="Optional: Comma-separated list of registered BioSample IDs" /> + </inputs> + + <outputs> + <data name="output" format="vcf" /> + <data name="output2" format="text" /> + </outputs> + + <tests> + <test> + <param name="input" value="sample.gd_snp" ftype="gd_snp" /> + <param name="choice" value="2" /> + <param name="geno" value="11" /> + <param name="hand" value="MyHandle" /> + <param name="batch" value="Test1" /> + <param name="ref" value="pb_000001.1" /> + <output name="output" file="snpsForSubmission.vcf" ftype="vcf" compare="diff" /> + <output name="output2" file="snpsForSubmission.text" ftype="text" compare="diff" /> + </test> + </tests> + + <help> + +**Dataset formats** + +The input dataset is in gd_snp_ or gd_genotype_ format. +The output consists of two datasets needed for submitting SNPs: +a VCF_ file in the specific format required by dbSNP, and a partially +completed text_ file for the associated dbSNP metadata. +(`Dataset missing?`_) + +.. _gd_snp: ./static/formatHelp.html#gd_snp +.. _gd_genotype: ./static/formatHelp.html#gd_genotype +.. _VCF: ./static/formatHelp.html#vcf +.. _text: ./static/formatHelp.html#text +.. _Dataset missing?: ./static/formatHelp.html + +----- + +**What it does** + +This tool converts a dataset in gd_snp or gd_genotype format to a VCF file formatted +for submission to the dbSNP database at NCBI. It also creates a partially +filled-in template to assist you in preparing the required "metadata" file +describing the SNP submission. + +----- + +**Example** + +- input:: + + #{"column_names":["scaf","pos","A","B","qual","ref","rpos","rnuc","1A","1B","1G","1Q","2A","2B","2G","2Q","3A","3B","3G","3Q","4A","4B","4G","4Q","5A","5B","5G","5Q","6A","6B","6G","6Q","pair","dist", + #"prim","rflp"],"dbkey":"canFam2","individuals":[["PB1",9],["PB2",13],["PB3",17],["PB4",21],["PB6",25],["PB8",29]],"pos":2,"rPos":7,"ref":6,"scaffold":1,"species":"bear"} + Contig161 115 C T 73.5 chr1 4641382 C 6 0 2 45 8 0 2 51 15 0 2 72 5 0 2 42 6 0 2 45 10 0 2 57 Y 54 0.323 0 + Contig48 11 A G 94.3 chr1 10150264 A 1 0 2 30 1 0 2 30 1 0 2 30 3 0 2 36 1 0 2 30 1 0 2 30 Y 22 +99. 0 + Contig20 66 C T 54.0 chr1 21313534 C 4 0 2 39 4 0 2 39 5 0 2 42 4 0 2 39 4 0 2 39 5 0 2 42 N 1 +99. 0 + etc. + +- VCF output (for all individuals, and giving a population ID):: + + #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT PB + Contig161 115 Contig161;115 C T 73.5 . VRT=6 NA:AC 8:0 + Contig48 11 Contig48;11 A G 94.3 . VRT=6 NA:AC 8:0 + Contig 66 Contig20;66 C T 54.0 . VRT=6 NA:AC 8:0 + etc. + +Note: This excerpt from the output does not show all of the headers. Also, +if the population ID had not been given, then the last two columns would not +appear in the output. + +----- + +**Reference** + +Sherry ST, Ward MH, Kholodov M, Baker J, Phan L, Smigielski EM, Sirotkin K. +dbSNP: the NCBI database of genetic variation. Nucleic Acids Res. 2001 +Jan 1;29(1):308-11. + + </help> +</tool> |
b |
diff -r 4188853b940b -r a631c2f6d913 genome_diversity/Makefile --- a/genome_diversity/Makefile Fri Jul 26 12:51:13 2013 -0400 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 |
b |
@@ -1,8 +0,0 @@ -all: - cd src && make - -clean: - cd src && make clean - -install: - cd src && make install |
b |
diff -r 4188853b940b -r a631c2f6d913 genome_diversity/bin/gd_ploteig --- a/genome_diversity/bin/gd_ploteig Fri Jul 26 12:51:13 2013 -0400 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 |
[ |
@@ -1,172 +0,0 @@ -#!/usr/bin/env perl - -### ploteig -i eigfile -p pops -c a:b [-t title] [-s stem] [-o outfile] [-x] [-k] [-y] [-z sep] -use Getopt::Std ; -use File::Basename ; -use warnings ; - -## pops : separated -x = make postscript and pdf -z use another separator -## -k keep intermediate files -## NEW if pops is a file names are read one per line - -getopts('i:o:p:c:s:d:z:t:xky',\%opts) ; -$postscmode = $opts{"x"} ; -$oldkeystyle = $opts{"y"} ; -$kflag = $opts{"k"} ; -$keepflag = 1 if ($kflag) ; -$keepflag = 1 unless ($postscmode) ; - -$zsep = ":" ; -if (defined $opts{"z"}) { - $zsep = $opts{"z"} ; - $zsep = "\+" if ($zsep eq "+") ; -} - -$title = "" ; -if (defined $opts{"t"}) { - $title = $opts{"t"} ; -} -if (defined $opts{"i"}) { - $infile = $opts{"i"} ; -} -else { - usage() ; - exit 0 ; -} -open (FF, $infile) || die "can't open $infile\n" ; -@L = (<FF>) ; -chomp @L ; -$nf = 0 ; -foreach $line (@L) { - next if ($line =~ /^\s+#/) ; - @Z = split " ", $line ; - $x = @Z ; - $nf = $x if ($nf < $x) ; -} -printf "## number of fields: %d\n", $nf ; -$popcol = $nf-1 ; - - -if (defined $opts{"p"}) { - $pops = $opts{"p"} ; -} -else { - die "p parameter compulsory\n" ; -} - -$popsname = setpops ($pops) ; -print "$popsname\n" ; - -$c1 = 1; $c2 =2 ; -if (defined $opts{"c"}) { - $cols = $opts{"c"} ; - ($c1, $c2) = split ":", $cols ; - die "bad c param: $cols\n" unless (defined $cols) ; -} - -$stem = "$infile.$c1:$c2" ; -if (defined $opts{"s"}) { - $stem = $opts{"s"} ; -} -$gnfile = "$stem.$popsname.xtxt" ; - -if (defined $opts{"o"}) { - $gnfile = $opts{"o"} ; -} - -@T = () ; ## trash -open (GG, ">$gnfile") || die "can't open $gnfile\n" ; -print GG "## " unless ($postscmode) ; -print GG "set terminal postscript color\n" ; -print GG "set style line 2 lc rgbcolor \"#376600\"\n"; -print GG "set style line 11 lc rgbcolor \"#376600\"\n"; -print GG "set style line 20 lc rgbcolor \"#376600\"\n"; -print GG "set style line 29 lc rgbcolor \"#376600\"\n"; -print GG "set style line 6 lc rgbcolor \"#FFCC00\"\n"; -print GG "set style line 15 lc rgbcolor \"#FFCC00\"\n"; -print GG "set style line 24 lc rgbcolor \"#FFCC00\"\n"; -print GG "set style increment user\n"; -print GG "set title \"$title\" \n" ; -print GG "set key outside\n" unless ($oldkeystyle) ; -print GG "set xlabel \"eigenvector $c1\" \n" ; -print GG "set ylabel \"eigenvector $c2\" \n" ; -print GG "plot " ; -$np = @P ; -$lastpop = $P[$np-1] ; -$d1 = $c1+1 ; -$d2 = $c2+1 ; -foreach $pop (@P) { - $dfile = "$stem:$pop" ; - push @T, $dfile ; - print GG " \"$dfile\" using $d1:$d2 title \"$pop\" " ; - print GG ", \\\n" unless ($pop eq $lastpop) ; - open (YY, ">$dfile") || die "can't open $dfile\n" ; - foreach $line (@L) { - next if ($line =~ /^\s+#/) ; - @Z = split " ", $line ; - next unless (defined $Z[$popcol]) ; - next unless ($Z[$popcol] eq $pop) ; - print YY "$line\n" ; - } - close YY ; -} -print GG "\n" ; -print GG "## " if ($postscmode) ; -print GG "pause 9999\n" ; -close GG ; - -if ($postscmode) { -$psfile = "$stem.ps" ; - - if ($gnfile =~ /xtxt/) { - $psfile = $gnfile ; - $psfile =~ s/xtxt/ps/ ; - } -system "gnuplot < $gnfile > $psfile" ; -#system "fixgreen $psfile" ; -system "ps2pdf $psfile " ; -} -unlink (@T) unless $keepflag ; - -sub usage { - -print "ploteig -i eigfile -p pops -c a:b [-t title] [-s stem] [-o outfile] [-x] [-k]\n" ; -print "-i eigfile input file first col indiv-id last col population\n" ; -print "## as output by smartpca in outputvecs \n" ; -print "-c a:b a, b columns to plot. 1:2 would be common and leading 2 eigenvectors\n" ; -print "-p pops Populations to plot. : delimited. eg -p Bantu:San:French\n" ; -print "## pops can also be a filename. List populations 1 per line\n" ; -print "[-s stem] stem will start various output files\n" ; -print "[-o ofile] ofile will be gnuplot control file. Should have xtxt suffix\n"; -print "[-x] make ps and pdf files\n" ; -print "[-k] keep various intermediate files although -x set\n" ; -print "## necessary if .xtxt file is to be hand edited\n" ; -print "[-y] put key at top right inside box (old mode)\n" ; -print "[-t] title (legend)\n" ; - -print "The xtxt file is a gnuplot file and can be easily hand edited. Intermediate files -needed if you want to make your own plot\n" ; - -} -sub setpops { - my ($pops) = @_ ; - local (@a, $d, $b, $e) ; - - if (-e $pops) { - open (FF1, $pops) || die "can't open $pops\n" ; - @P = () ; - foreach $line (<FF1>) { - ($a) = split " ", $line ; - next unless (defined $a) ; - next if ($a =~ /\#/) ; - push @P, $a ; - } - $out = join ":", @P ; - print "## pops: $out\n" ; - ($b, $d , $e) = fileparse($pops) ; - return $b ; - } - @P = split $zsep, $pops ; - return $pops ; - -} |
b |
diff -r 4188853b940b -r a631c2f6d913 genome_diversity/bin/varplot --- a/genome_diversity/bin/varplot Fri Jul 26 12:51:13 2013 -0400 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 |
[ |
b'@@ -1,464 +0,0 @@\n-#!/usr/bin/env python\n-\n-"""\n-Take a specification file and draw the Miller plot.\n-\n-The specification should have a header where each line starts with a "@". "@"\n-should be followed by a tag and value separated by a "=". Currently the only\n-defined tag is GL which is the genome length of the genome under consideration.\n-\n-The lines after the header should be one for each genome. The first column\n-should be the name of the individual/genome followed by the space separated\n-positions which need to be marked.\n-\n-An example spec file will be as follows:\n-\n-@GN=IR_3\n-@GL=14574\n-@GA=0:64::tRNA\n-@GA=64:1035:nad2:gene\n-@GA=1035:1100::tRNA\n-@GA=1092:1153::tRNA\n-@GA=1160:1226::tRNA\n-@GA=1218:2757:cox1:gene\n-@GA=2764:3440:cox2:gene\n-@GA=3440:3509::tRNA\n-@GA=3508:3574::tRNA\n-@GA=3574:3730:atp8:gene\n-@GA=3723:4389:atp6:gene\n-@GA=4395:5173:cox3:gene\n-@GA=5173:5236::tRNA\n-@GA=5236:5572:nad3:gene\n-@GA=5572:5633::tRNA\n-@GA=5632:5696::tRNA\n-@GA=5695:5763::tRNA\n-@GA=5765:5820::tRNA\n-@GA=5820:5885::tRNA\n-@GA=5883:5948::tRNA\n-@GA=5948:7617:nad5:gene\n-@GA=7617:7678::tRNA\n-@GA=7680:8997:nad4:gene\n-@GA=8990:9266:nad4L:gene\n-@GA=9268:9330::tRNA\n-@GA=9330:9395::tRNA\n-@GA=9397:9826:nad6:gene\n-@GA=9829:10910:cob:gene\n-@GA=10910:10976::tRNA\n-@GA=10993:11912:nad1:gene\n-@GA=11912:11978::tRNA\n-@GA=11992:12053::tRNA\n-@GA=12034:13289:rrnL:gene\n-@GA=12034:13289:16S:rRNA\n-@GA=13289:13351::tRNA\n-@GA=13351:14069:rrnS:gene\n-@GA=13351:14069:12S:rRNA\n-@GA=14423:14492::tRNA\n-@GA=14499:14569::tRNA\n-@CL=rRNA:#2B83BA\n-@CL=tRNA:#FFFFBF\n-@CL=gene:#D7191C\n-@CL=special:#000000\n-@CL=indel:#FDAE61\n-@CL=missing:#ABDDA4\n-IR_65 2618 3267 3752 7768 8523 special=10177 10848 12790 13157 indel=3500:3560\n-missing=4000:6000\n-IR_66 2618 3267 3752 7768 8523 special=10177 10848 12790 13157 missing=4000:6000\n-IR_63 2618 3267 3752 4883 8523 9798 10848 13157 missing=1:1000\n-"""\n-\n-from sys import argv, stderr, exit\n-from getopt import getopt, GetoptError\n-from commands import getstatusoutput\n-\n-__author__ = "Aakrosh Ratan"\n-__email__ = "ratan@bx.psu.edu"\n-\n-# do we want the debug information to be printed?\n-debug_flag = False\n-\n-varstrokewidth = 1\n-\n-# print the header for the image.. Inputs are in cm\n-def printheader(imagewidth, imageheight):\n- print "<?xml version=\\"1.0\\" standalone=\\"no\\"?>"\n- print "<!DOCTYPE svg PUBLIC \\"-//W3C//DTD SVG 1.1//EN\\""\n- print "\\"http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd\\">"\n-\n- print "<svg width=\\"%dcm\\"" % imagewidth\n- print "\\theight=\\"%dcm\\"" % imageheight\n- print "\\tviewBox=\\"0 0 %d %d\\"" % (100*imagewidth, 100*imageheight)\n- print "\\txmlns=\\"http://www.w3.org/2000/svg\\" version=\\"1.1\\">"\n-\n-# print the footer for the svg image.. Inputs are in cm\n-def printfooter():\n- print "</svg>"\n-\n-# print a rectangle\n-def printrectangle(x, y, w, h, \n- sw = 2, \n- so = 1.0,\n- rx = None, \n- cp = None, \n- fl = "none",\n- fo = 1.0):\n-# print >> stderr, "Rectangle: %d %d %d %d" % (x, y, w, h)\n- print "<rect x=\\"%d\\"" % x\n- print "\\ty=\\"%d\\"" % y\n- print "\\twidth=\\"%d\\"" % w \n- print "\\theight=\\"%d\\"" % h\n- if rx != None: print "\\trx=\\"%d\\"" % rx\n- print "\\tstroke=\\"black\\""\n- if cp != None: print "\\tclip-path=\\"url(#g%d)\\"" % cp\n- print "\\tstroke-width=\\"%d\\"" % sw\n- print "\\tstroke-opacity=\\"%2.2f\\"" % so\n- print "\\tfill-opacity=\\"%2.2f\\"" % fo\n- print "\\tfill=\\"%s\\" />" % fl\n-\n-def printtext(x, y, text, \n- fontfamily = "Times", \n- fontweight = "bold",\n- fontsize = "0.9cm",\n- fontvariant = "normal",\n- fill = "black",\n- dx = 0):\n-# print >> stderr, "Text: %d %d %s" % (x, y, text)\n- print "<text x=\\"%d\\"" % x\n- print "\\tdx=\\"%d\\"" % dx\n- print "\\ty=\\"%d\\"" % y\n- print "\\tfill=\\"%s\\"" % fill\n- print "\\tfont-family=\\"%s\\"" % fontfamily\n- print "\\tfont-weight=\\"%s\\"" % fontweight\n'..b'= "Black"\n- \n- printtext(x,\n- y + (eachplotheight * 85),\n- name,\n- fill = color,\n- dx = (((stop-start)*cm2pt/cm2bp)-(wordlen*cm2pt))/2)\n- else:\n- # will this name fit at all, even at the bottom? Where is the\n- # next text label that I need to write?\n- tmpidx = index + 1\n- while tmpidx < len(attributes) and \\\n- len(attributes[tmpidx][2]) == 0:\n- tmpidx += 1\n- \n- if tmpidx < len(attributes):\n- nextstart = int(attributes[tmpidx][0])\n- if ((wordlen*cm2pt) < ((nextstart-start) * cm2pt/cm2bp)):\n- printtext(x, \n- y + (eachplotheight + vgap) * cm2pt, \n- name,\n- colors.get(group, "Black"))\n- \n- y += ((eachplotheight + vgap) * cm2pt)\n- \n- # print the coordinates on a line\n- y += vgap * cm2pt\n- if debug_flag == True:\n- printrectangle(0, y, imagewidth * cm2pt, eachplotheight * cm2pt) \n-\n- printline(figstart, y, figstart + figlen, y)\n-\n- x = figstart\n- ticlength = 15\n- for i in range(0, genomelength, 2000):\n- printline(x, y, x, y + ticlength)\n- printtext(x, y + ticlength + vgap * cm2pt, str(i), fontweight="normal")\n- x += (2000 * cm2pt / cm2bp) \n- printline(figstart + figlen, y, figstart + figlen, y + ticlength)\n-\n- # print the legend if there were attributes\n- if len(attributes) > 0:\n- if debug_flag == True:\n- printrectangle(0, y, imagewidth * cm2pt, eachplotheight * cm2pt) \n- \n- y += ((eachplotheight + 2 * vgap) * cm2pt)\n- x = figstart\n- \n- for name,color in colors.items():\n- printtext(x, y, name, fontsize = "0.9cm")\n- x += ((len(name) + 1) * mf * cm2pt) \n- printrectangle(x, \n- y - eachplotheight * cm2pt + 10, \n- 100, \n- eachplotheight * cm2pt * 3 / 4,\n- fl = color)\n- x += 125\n-\n- # end of the image\n- printfooter()\n-\n-def usage():\n- f = stderr\n- print >> f, "usage:"\n- print >> f, "varplot [options] specfile"\n- print >> f, "where the options are:"\n- print >> f, "-h,--help : print usage and quit"\n- print >> f, "-d,--debug: print debug information"\n- print >> f, "-w,--strokewidth: stroke width for normal variants [1]"\n- print >> f, "-b,--eachplotheight : height of the plot for an individual (in cm) [0.4]"\n- print >> f, "-g,--eachplotgap : vertical gap between plots of different individuals (in cm) [0.4]"\n-\n-if __name__ == "__main__":\n- try:\n- opts, args = getopt(argv[1:],"hdw:s:g:",["help","debug","strokewidth=", "eachplotheight=", "eachplotgap="])\n- except GetoptError, err:\n- print str(err)\n- usage()\n- exit(2) \n-\n- # number of bases to be drawn in 1 cm\n- cm2bp = 1000\n-\n- # the strokewidth used to show simple SNPs\n- varstrokewidth = 1\n-\n- # the height of the plot for each individual (in cm)\n- eachplotheight = 0.4\n-\n- # vertical gap between plots of different individuals (in cm)\n- vgap = 0.4\n-\n- for o, a in opts:\n- if o in ("-h", "--help"):\n- usage()\n- exit()\n- elif o in ("-d", "--debug"):\n- debug_flag = True\n- elif o in ("-w", "--strokewidth"):\n- varstrokewidth = int(a)\n- elif o in ("-s", "--eachplotheight"):\n- eachplotheight = float(a)\n- elif o in ("-g", "--eachplotgap"):\n- vgap = float(a)\n- else:\n- assert False, "unhandled option"\n-\n- if len(args) != 1:\n- usage()\n- exit(3)\n-\n- main(args[0], cm2bp, eachplotheight, vgap)\n' |
b |
diff -r 4188853b940b -r a631c2f6d913 genome_diversity/src/Fst_ave.c --- a/genome_diversity/src/Fst_ave.c Fri Jul 26 12:51:13 2013 -0400 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 |
[ |
b'@@ -1,361 +0,0 @@\n-/* Fst_ave -- determine four FST values between two specified populations,\n-* and optionally between several pairs of random populations\n-*\n-* argv{1] = a Galaxy SNP table. For each of several individuals, the table\n-* has four columns (#A, #B, genotype, quality).\n-* argv[2] = 1 if FST is estimated from SAMtools genotypes; 0 means use\n-*\t read-coverage data.\n-* argv[3] = lower bound, for individual quality value if argv[2] = 1,\n-*\t or for total number of reads per population if argv[2] = 0.\n-*\t SNPs not satisfying these lower bounds are ignored.\n-* argv[4] = 1 to discard SNPs that appear fixed in the two populations\n-* argv[5] = k says report the maximum and average FST over k randomly\n-* chosen splits into two populations of two original sizes\n-* argv[6], argv[7], ..., have the form "13:1", "13:2" or "13:0", meaning\n-* that the 13th and 14th columns (base 1) give the allele counts\n-* (and column 15 gives the genotype) for an individual that is in\n-*\t population 1, in population 2, or in neither population.\n-\n-What it does on Galaxy\n-\n-The user specifies a SNP table and two "populations" of individuals, both previously defined using the Galaxy tool to specify individuals from a SNP table. No individual can be in both populations. Other choices are as follows.\n-\n-Data source. The allele frequencies of a SNP in the two populations can be estimated either by the total number of reads of each allele, or by adding the frequencies inferred from genotypes of individuals in the populations.\n-\n-After specifying the data source, the user sets lower bounds on amount of data required at a SNP. For estimating the FST using read counts, the bound is the minimum count of reads of the two alleles in a population. For estimations based on genotype, the bound is the minimum reported genotype quality per individual. SMPs not meeting these lower bounds are ignored.\n-\n-The user specifies whether SNPs where both populations appear to be fixed for the same allele should be retained or discarded.\n-\n-Finally, the user decides whether to use randomizations. If so, then the user specifies how many randomly generated population pairs (retaining the numbers of individuals of the originals) to generate, as well as the "population" of additional individuals (not in the first two populations) that can be used in the randomization process.\n-\n-The program prints the following measures of FST for the two populations.\n-1. The formulation by Sewall Wright (average over FSTs for all SNPs).\n-2. The Weir-Cockerham estimator (average over FSTs for all SNPs).\n-3. The Reich-Patterson estimator (average over FSTs for all SNPs).\n-4. The population-based Reich-Patterson estimator.\n-\n-If randomizations were requested, it prints a summary for each of the four definitions of FST that includes the maximum and average value, and the highest-scoring population pair (if any scored higher than the two user-specified populations).\n-\n-References:\n-\n-Sewall Wright (1951) The genetical structure of populations. Ann Eugen 15:323-354.\n-\n-B. S. Weir and C. Clark Cockerham (1984) Estimating F-statistics for the analysis of population structure. Evolution 38:1358-1370.\n-\n-Weir, B.S. 1996. Population substructure. Genetic data analysis II, pp. 161-173. Sinauer Associates, Sunderland, MA.\n-\n-David Reich, Kumarasamy Thangaraj, Nick Patterson, Alkes L. Price, and Lalji Singh (2009) Reconstructing Indian population history. Nature 461:489-494, especially Supplement 2. \n-\n-Their effectiveness for computing FSTs when there are many SNPs but few individuals is discussed in the followoing paper.\n-\n-Eva-Maria Willing, Christine Dreyer, Cock van Oosterhout (2012) Estimates of genetic differentiation measured by FST do not necessarily require large sample sizes when using many SNP markers. PLoS One 7:e42649.\n-\n-*/\n-\n-#include "lib.h"\n-#include "Fst_lib.h"\n-\n-// maximum length of a line '..b'\n-\t\tnew->c = ckalloc(nI*sizeof(struct count));\n-\t\t// set X[i] = atoi(i-th word of buf), i is base 1\n-\t\tfor (i = 1, p = strtok(buf, z); p != NULL;\n-\t\t ++i, p = strtok(NULL, z))\n-\t\t\tX[i] = atoi(p);\n-\t\tfor (i = 0; i < nI; ++i) {\n-\t\t\tn = col[i];\n-\t\t\tif (genotypes) {\n-\t\t\t\tk = X[n+2];\n-\t\t\t\tif (k == -1)\n-\t\t\t\t\tnew->c[i].A = new->c[i].B = -1;\n-\t\t\t\telse {\n-\t\t\t\t\tnew->c[i].A = k;\n-\t\t\t\t\tnew->c[i].B = 2 - k;\n-\t\t\t\t}\n-\t\t\t} else {\n-\t\t\t\tnew->c[i].A = X[n];\n-\t\t\t\tnew->c[i].B = X[n+1];\n-\t\t\t}\n-\t\t}\n-\t\tif (start == NULL)\n-\t\t\tstart = new;\n-\t\telse\n-\t\t\tlast->next = new;\n-\t\tlast = new;\n-\t}\n-\tfclose(fp);\n-\n-\tpop_Fst();\n-\tprintf("Using %d SNPs, we compute:\\n", nsnp);\n-\tprintf("Average Reich-Patterson FST is %5.5f.\\n", F2 = F_reich);\n-\tprintf("The population-based Reich-Patterson Fst is %5.5f.\\n",\n-\t F3 = N_reich/D_reich);\n-\tprintf("Average Weir-Cockerham FST is %5.5f.\\n", F1 = F_weir);\n-\tprintf("Average Wright FST is %5.5f.\\n", F0 = F_wright);\n-\tif (nfail > 0)\n-\t\tprintf("WARNING: %d of %d FSTs could not be computed\\n",\n-\t\t nfail, 3*nsnp);\n-\tif (nshuff == 0)\n-\t\treturn 0;\n-\n-\t// do the following only if randomization is requested\n-\tfor (j = 0; j < nI; ++j)\n-\t\tbest_x0[j] = best_x1[j] = best_x2[j] = best_x3[j] = x[j];\n-\ttot_F0 = tot_F1 = tot_F2 = tot_F3 =\n-\t largest_F0 = largest_F1 = largest_F2 = largest_F3 = 0.0;\n-\tlarger0 = larger1 = larger2 = larger3 = 0;\n-\tfor (i = 0; i < nshuff; ++i) {\n-\t\tshuffle();\n-\t\tpop_Fst();\n-\n-\t\t// Wright\n-\t\tif ((F = F_wright) > F0)\n-\t\t\t++larger0;\n-\t\tif (F > largest_F0) {\n-\t\t\tlargest_F0 = F;\n-\t\t\tfor (j = 0; j < nI; ++j)\n-\t\t\t\tbest_x0[j] = x[j];\n-\t\t}\n-\t\ttot_F0 += F;\n-/*\n-\t\tif (all)\t// make this optional?\n-\t\t\tprintf("%d: %f\\n", i+1, F);\n-*/\n-\n-\t\t// Weir\n-\t\tif ((F = F_weir) > F1)\n-\t\t\t++larger1;\n-\t\tif (F > largest_F1) {\n-\t\t\tlargest_F1 = F;\n-\t\t\tfor (j = 0; j < nI; ++j)\n-\t\t\t\tbest_x1[j] = x[j];\n-\t\t}\n-\t\ttot_F1 += F;\n-\n-\t\t// Riech average\n-\t\tif ((F = F_reich) > F2)\n-\t\t\t++larger2;\n-\t\tif (F > largest_F2) {\n-\t\t\tlargest_F2 = F;\n-\t\t\tfor (j = 0; j < nI; ++j)\n-\t\t\t\tbest_x2[j] = x[j];\n-\t\t}\n-\t\ttot_F2 += F;\n-\n-\t\t// Reich population\n-\t\tif ((F = (N_reich/D_reich)) > F3)\n-\t\t\t++larger3;\n-\t\tif (F > largest_F3) {\n-\t\t\tlargest_F3 = F;\n-\t\t\tfor (j = 0; j < nI; ++j)\n-\t\t\t\tbest_x3[j] = x[j];\n-\t\t}\n-\t\ttot_F3 += F;\n-\t}\n-\tprintf("\\nOf %d random groupings:\\n", nshuff);\n-\tprintf("%d had a larger average Wright FST (max %5.5f, mean %5.5f)\\n",\n-\t larger0, largest_F0, tot_F0/nshuff);\n-\tif (largest_F0 > F0) {\n-\t\tprintf("first columns for the best two populations:\\n");\n-\t\tfor (i = 0; i < nI; ++i)\n-\t\t\tif (best_x0[i] == 1)\n-\t\t\t\tprintf("%d ", col[i]);\n-\t\tprintf("and\\n");\n-\t\tfor (i = 0; i < nI; ++i)\n-\t\t\tif (best_x0[i] == 2)\n-\t\t\t\tprintf("%d ", col[i]);\n-\t\tputchar(\'\\n\');\n-\t\tputchar(\'\\n\');\n-\t}\n-\tprintf("%d had a larger average Weir-Cockerham FST (max %5.5f, mean %5.5f)\\n",\n-\t larger1, largest_F1, tot_F1/nshuff);\n-\tif (largest_F1 > F1) {\n-\t\tprintf("first columns for the best two populations:\\n");\n-\t\tfor (i = 0; i < nI; ++i)\n-\t\t\tif (best_x1[i] == 1)\n-\t\t\t\tprintf("%d ", col[i]);\n-\t\tprintf("and\\n");\n-\t\tfor (i = 0; i < nI; ++i)\n-\t\t\tif (best_x1[i] == 2)\n-\t\t\t\tprintf("%d ", col[i]);\n-\t\tputchar(\'\\n\');\n-\t\tputchar(\'\\n\');\n-\t}\n-\tprintf("%d had a larger average Reich-Patterson FST (max %5.5f, mean %5.5f)\\n",\n-\t larger2, largest_F2, tot_F2/nshuff);\n-\tif (largest_F2 > F2) {\n-\t\tprintf("first columns for the best two populations:\\n");\n-\t\tfor (i = 0; i < nI; ++i)\n-\t\t\tif (best_x2[i] == 1)\n-\t\t\t\tprintf("%d ", col[i]);\n-\t\tprintf("and\\n");\n-\t\tfor (i = 0; i < nI; ++i)\n-\t\t\tif (best_x2[i] == 2)\n-\t\t\t\tprintf("%d ", col[i]);\n-\t\tputchar(\'\\n\');\n-\t\tputchar(\'\\n\');\n-\t}\n-\tprintf("%d had a larger Reich-Patterson population FST (max %5.5f, mean %5.5f)\\n",\n-\t larger3, largest_F3, tot_F3/nshuff);\n-\tif (largest_F3 > F3) {\n-\t\tprintf("first columns for the best two populations:\\n");\n-\t\tfor (i = 0; i < nI; ++i)\n-\t\t\tif (best_x3[i] == 1)\n-\t\t\t\tprintf("%d ", col[i]);\n-\t\tprintf("and\\n");\n-\t\tfor (i = 0; i < nI; ++i)\n-\t\t\tif (best_x3[i] == 2)\n-\t\t\t\tprintf("%d ", col[i]);\n-\t\tputchar(\'\\n\');\n-\t\tputchar(\'\\n\');\n-\t}\n-\n-\treturn 0;\n-}\n' |
b |
diff -r 4188853b940b -r a631c2f6d913 genome_diversity/src/Fst_column.c --- a/genome_diversity/src/Fst_column.c Fri Jul 26 12:51:13 2013 -0400 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 |
[ |
@@ -1,162 +0,0 @@ -/* Fst_column -- add an Fst column to a Galaxy table -* -* argv{1] = a Galaxy SNP table. For each of several individuals, the table -* has four columns (#A, #B, genotype, quality). -* argv[2] = 1 if Fst is estimated from SAMtools genotypes; 0 means use -* read-coverage data. -* argv[3] = lower bound for total number of reads per population -* argv[4] = lower bound for individual quality value -* argv[5] = 1 to retain SNPs that fail to satisfy the lower bound and set -* Fst = -1; delete them if argv[4] = 0. -* argv[6] = 1 to discard SNPs that appear fixed in the two populations -* argv[7] = 0 for the original Wright form, 1 for Weir, 2 for Reich -* argv[8], argv[9], ..., have the form "13:1" or "13:2", meaning that -* the 13th, 14th, and 15th columns (base 1) give the allele counts -* and genotype for an individual that is in population 1 or -* population 2, respectively. - -What It Does on Galaxy - -The user specifies a SNP table and two "populations" of individuals, both previously defined using the Galaxy tool to specify individuals from a SNP table. No individual can be in both populations. Other choices are as follows. - -Data source. The allele frequencies of a SNP in the two populations can be estimated either by the total number of reads of each allele, or by adding the frequencies inferred from genotypes of individuals in the populations. - -After specifying the data source, the user sets lower bounds on amount of data required at a SNP. For estimating the Fst using read counts, the bound is the minimum count of reads of the two alleles in a population. For estimations based on genotype, the bound is the minimum reported genotype quality per individual. - -The user specifies whether the SNPs that violate the lower bound should be ignored or the Fst set to -1. - -The user specifies whether SNPs where both populations appear to be fixed for the same allele should be retained or discarded. - -Finally, the user chooses which definition of Fst to use: Wright's original definition, the Weir-Cockerham unbiased estimator, or the Reich-Patterson estimator. - -A column is appended to the SNP table giving the Fst for each retained SNP. - -References: - -Sewall Wright (1951) The genetical structure of populations. Ann Eugen 15:323-354. - -B. S. Weir and C. Clark Cockerham (1984) Estimating F-statistics for the analysis of population structure. Evolution 38:1358-1370. - -Weir, B.S. 1996. Population substructure. Genetic data analysis II, pp. 161-173. Sinauer Associates, Sunderland, MA. - -David Reich, Kumarasamy Thangaraj, Nick Patterson, Alkes L. Price, and Lalji Singh (2009) Reconstructing Indian population history. Nature 461:489-494, especially Supplement 2. - -Their effectiveness for computing FSTs when there are many SNPs but few individuals is discussed in the followoing paper. - -Eva-Maria Willing, Christine Dreyer, Cock van Oosterhout (2012) Estimates of genetic differentiation measured by FST do not necessarily require large sample sizes when using many SNP markers. PLoS One 7:e42649. - -*/ - -#include "lib.h" -#include "Fst_lib.h" - -// most characters allowed in a row of the table -#define MOST 50000 - -// column and population for the relevant individuals/groups -int col[MOST], pop[MOST]; -int nI; - -int main(int argc, char **argv) { - FILE *fp; - char *p, *z = "\t\n", buf[MOST], trash[MOST]; - int X[MOST], min_cov, min_qual, retain, discard, unbiased, genotypes, - n, i, g, A1, B1, A2, B2, saw[3], x1, y1, x2, y2; - double F, N, D; - - if (argc < 7) - fatal("args: table data-source lower-bound retain? discard? unbiased? n:1 m:2 ..."); - genotypes = atoi(argv[2]); - min_cov = atoi(argv[3]); - min_qual = atoi(argv[4]); - retain = atoi(argv[5]); - discard = atoi(argv[6]); - unbiased = atoi(argv[7]); - saw[1] = saw[2] = 0; - for (i = 8; i < argc; ++i, ++nI) { - if (sscanf(argv[i], "%d:%d", &(col[nI]), &(pop[nI])) != 2) - fatalf("not like 13:2 : %s", argv[i]); - if (pop[nI] < 1 || pop[nI] > 2) - fatalf("not population 1 or 2: %s", argv[i]); - saw[pop[nI]] = 1; - // seen this individual before? - for (n = 0; n < nI && col[n] != col[nI]; ++n) - ; - if (n < nI) - fatalf("individual at column %d is mentioned twice", - col[n]); - } - if (saw[1] == 0) - fatal("population 1 is empty"); - if (saw[2] == 0) - fatal("population 2 is empty"); - - fp = ckopen(argv[1], "r"); - while (fgets(buf, MOST, fp)) { - if (buf[0] == '#') - continue; - strcpy(trash, buf); - // set X[i] = atoi(i-th word of s), i is base 0 - for (i = 1, p = strtok(trash, z); p != NULL; - ++i, p = strtok(NULL, z)) - X[i] = atoi(p); - for (i = A1 = B1 = A2 = B2 = x1 = y1 = x2 = y2 = 0; - i < nI; ++i) { - n = col[i]; - g = X[n+2]; // save genotype - - if (genotypes) { - if (g == -1) - continue; - } else if (X[n+3] < min_qual) - continue; - if (pop[i] == 1) { - // column n (base 1) corresponds to entry X[n] - x1 += X[n]; - y1 += X[n+1]; - if (genotypes) { - A1 += g; - B1 += (2 - g); - } else { - A1 += X[n]; - B1 += X[n+1]; - } - } else if (pop[i] == 2) { - x2 += X[n]; - y2 += X[n+1]; - if (genotypes) { - A2 += g; - B2 += (2 - g); - } else { - A2 += X[n]; - B2 += X[n+1]; - } - } - } - if (discard && ((A1 == 0 && A2 == 0) || (B1 == 0 && B2 == 0))) - continue; // not variable in the two populations - if (!genotypes && (x1+y1 < min_cov || x2+y2 < min_cov)) - F = -1.0; - else { - if (unbiased == 0) - wright(A1, A2, B1, B2, &N, &D); - else if (unbiased == 1) - weir(A1, A2, B1, B2, &N, &D); - else if (unbiased == 2) - reich(A1, A2, B1, B2, &N, &D); - else - fatal("impossible value of 'unbiased'"); - if (D == 0.0) - continue; // ignore these SNPs - else - F = N/D; - } - if (F == -1.0 && !retain) - continue; - if ((p = strchr(buf, '\n')) != NULL) - *p = '\0'; - printf("%s\t%5.4f\n", buf, F); - } - - return 0; -} |
b |
diff -r 4188853b940b -r a631c2f6d913 genome_diversity/src/Fst_lib.c --- a/genome_diversity/src/Fst_lib.c Fri Jul 26 12:51:13 2013 -0400 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 |
b |
@@ -1,73 +0,0 @@ -// This file contains three procedures for computing different variants of Fst. - -#include "Fst_lib.h" - -/* For each of wright, weir and reich, args 1 and 2 are counts for one -* allele in the two populations; args 3 and 4 are counts for the other allele. -* The numerator and denominator of the computed Fst are returned through -* args 5 and 6. -*/ - -void wright(int A1, int A2, int B1, int B2, double *N, double *D) { - double a1 = A1, a2 = A2, b1 = B1, b2 = B2, n1, n2, p1, p2; - - double - p, // frequency in the pooled population - H_ave, // average of HWE heterogosity in the two populations - H_all; // HWE heterozygosity in the pooled popuations - - n1 = a1+b1; - n2 = a2+b2; - if (n1 == 0.0 || n2 == 0.0) { - // let the calling program handle it - *N = *D = 0.0; - return; - } - p1 = a1/n1; - p2 = a2/n2; - H_ave = p1*(1.0 - p1) + p2*(1.0 - p2); - p = (p1 + p2)/2.0; - H_all = 2.0*p*(1.0 - p); - *N = H_all - H_ave; - *D = H_all; -} - -void weir(int A1, int A2, int B1, int B2, double *N, double *D) { - double a1 = A1, a2 = A2, b1 = B1, b2 = B2, n1, n2, p1, p2, - n_tot, p_bar, nc, MSP, MSG; - - n1 = a1+b1; - n2 = a2+b2; - if (n1 == 0.0 || n2 == 0.0) { - // let the calling program handle it - *N = *D = 0.0; - return; - } - n_tot = n1 + n2; - p1 = a1/n1; - p2 = a2/n2; - - MSG = (n1*p1*(1.0-p1) + n2*p2*(1.0-p2))/(n_tot-2.0); - p_bar = (n1*p1 + n2*p2)/n_tot; - MSP = n1*(p1-p_bar)*(p1-p_bar) + n2*(p2-p_bar)*(p2-p_bar); - nc = n_tot - (n1*n1 + n2*n2)/n_tot; - *N = MSP - MSG; - *D = MSP + (nc-1)*MSG; -} - -void reich(int A1, int A2, int B1, int B2, double *N, double *D) { - double a1 = A1, a2 = A2, b1 = B1, b2 = B2, n1, n2, h1, h2, x; - - n1 = a1+b1; - n2 = a2+b2; - if (n1<=1 || n2<=1) { - // let the calling program handle it - *N = *D = 0.0; - return; - } - h1 = (a1*(n1-a1)) / (n1*(n1-1)); - h2 = (a2*(n2-a2)) / (n2*(n2-1)); - x = a1/n1 - a2/n2; - *N = x*x - h1/n1 - h2/n2; - *D = *N + h1 + h2; -} |
b |
diff -r 4188853b940b -r a631c2f6d913 genome_diversity/src/Fst_lib.h --- a/genome_diversity/src/Fst_lib.h Fri Jul 26 12:51:13 2013 -0400 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 |
b |
@@ -1,11 +0,0 @@ -// Fst_lib.h - -/* For each of wright, weir and reich, args 1 and 2 are counts for one -* allele in the two populations; args 3 and 4 are counts for the other allele. -* The numerator and denominator of the computed Fst are returned through -* args 5 and 6. -*/ - -void wright(int, int , int , int , double * , double * ); -void weir(int, int , int , int , double * , double * ); -void reich(int, int , int , int , double * , double * ); |
b |
diff -r 4188853b940b -r a631c2f6d913 genome_diversity/src/Huang.c --- a/genome_diversity/src/Huang.c Fri Jul 26 12:51:13 2013 -0400 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 |
[ |
@@ -1,44 +0,0 @@ -// Find highest scoring intervals, as discussed in Huang.h. - -#include "lib.h" -#include "Huang.h" - -void Huang(double x[], int n) { - double Score, oldScore; - int v, L, i; - - top = 0; // don't use location 0, so as to follow Fig. 6 - for (Score = 0.0, v = 0; v < n; ++v) { - oldScore = Score; - Score += x[v]; - if (x[v] < 0) - continue; - if (top > 0 && R[top].Rpos == v-1) { - // add edge to top subpath - R[top].Rpos = v; - R[top].Rscore = Score; - } else { - // create a one-edge subpath - ++top; - if (top >= MAX_R) - fatal("In Haung(), top is too big"); - R[top].Lpos = v-1; - R[top].Lscore = oldScore; - R[top].Rpos = v; - R[top].Rscore = Score; - R[top].Lower = top-1; - while ((L = R[top].Lower) > 0 && - R[L].Lscore > R[top].Lscore) - R[top].Lower = R[L].Lower; - } - // merge subpaths - while (top > 1 && (L = R[top].Lower) > 0 && - R[L].Rscore <= R[top].Rscore) { - R[L].Rpos = R[top].Rpos; - R[L].Rscore = R[top].Rscore; - top = L; - } - } - for (i = 1; i <= top; ++i) - R[i].Score = R[i].Rscore - R[i].Lscore; -} |
b |
diff -r 4188853b940b -r a631c2f6d913 genome_diversity/src/Huang.h --- a/genome_diversity/src/Huang.h Fri Jul 26 12:51:13 2013 -0400 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 |
[ |
@@ -1,20 +0,0 @@ -/* Find intervals of highest total score, i.e., such that adding postions to -* either end will decrease the total. We use the method of Fig. 6 of the paper: -* Xiaoqiu Huang, Pavel Pevzner, Webb Miller (1994) Parametric recomputing in -* alignment graphs. Combinatorial Pattern Matching (Springer Lecture Notes in -* Computer Science, 807), 87-101. -* -* The input scores are in x[0], x[1], ..., x[n-1], but the output regions -* are in R[1], R[2], ..., R[top]. R[i].Score is the total score of the i-th -* (in order of position) positive-scoring interval of x, which consists of of -* x[R[i].Lpos + 1] to x[R[i].Rpos]. -*/ -#define MAX_R 5000000 - -struct region { // a consecutive (relative to the reference) run of SNPs - double Lscore, Rscore, Score; - int Lpos, Rpos, Lower; -} R[MAX_R]; -int top; - -void Huang(double *x, int n); |
b |
diff -r 4188853b940b -r a631c2f6d913 genome_diversity/src/Makefile --- a/genome_diversity/src/Makefile Fri Jul 26 12:51:13 2013 -0400 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 |
[ |
@@ -1,61 +0,0 @@ -CC = gcc -COPT = -O2 -CWARN = -W -Wall -CFLAGS = $(COPT) $(CWARN) -INSTALL_DIR = ../bin - -TARGETS = admix_prep aggregate coords2admix coverage dist_mat dpmix \ - eval2pct filter_snps Fst_ave Fst_column get_pi mk_Ji mt_pi sweep - -all: $(TARGETS) - -install: $(TARGETS) - if [ ! -d "$(INSTALL_DIR)" ]; then mkdir -p "$(INSTALL_DIR)"; fi - cp $(TARGETS) $(INSTALL_DIR) - -admix_prep: admix_prep.c lib.c - $(CC) $(CFLAGS) $^ -o $@ - -aggregate: aggregate.c lib.c - $(CC) $(CFLAGS) $^ -o $@ - -coords2admix: coords2admix.c lib.c - $(CC) $(CFLAGS) $^ -o $@ - -coverage: coverage.c lib.c - $(CC) $(CFLAGS) $^ -o $@ - -dist_mat: dist_mat.c lib.c - $(CC) $(CFLAGS) $^ -o $@ - -dpmix: dpmix.c lib.c - $(CC) $(CFLAGS) $^ -lm -o $@ - -eval2pct: eval2pct.c lib.c - $(CC) $(CFLAGS) $^ -o $@ - -filter_snps: filter_snps.c lib.c - $(CC) $(CFLAGS) $^ -o $@ - -Fst_ave: Fst_ave.c Fst_lib.c lib.c - $(CC) $(CFLAGS) $^ -o $@ - -Fst_column: Fst_column.c Fst_lib.c lib.c - $(CC) $(CFLAGS) $^ -o $@ - -get_pi: get_pi.c lib.c - $(CC) $(CFLAGS) $^ -o $@ - -mk_Ji: mk_Ji.c lib.c mito_lib.c - $(CC) $(CWARN) $^ -o $@ - -mt_pi: mt_pi.c lib.c mito_lib.c - $(CC) $(CWARN) $^ -o $@ - -sweep: sweep.c lib.c Huang.c - $(CC) $(CFLAGS) $^ -o $@ - -.PHONY: clean - -clean: - rm -f $(TARGETS) |
b |
diff -r 4188853b940b -r a631c2f6d913 genome_diversity/src/admix_prep.c --- a/genome_diversity/src/admix_prep.c Fri Jul 26 12:51:13 2013 -0400 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 |
[ |
@@ -1,119 +0,0 @@ -/* admix_prep -- prepare the ".ped" and ".map" files (PLINK format) for input to -* the "admixture" program. -* -* argv[1] -- a Galaxy SNP table -* argv[2] -- required number of reads for each individual to use a SNP -* argv[3] -- required genotype quality for each individual to use a SNP -* argv[4] -- minimum spacing between SNPs on the same scaffold -* argv[k] for k > 4 have the form "13:fred", meaning that the 13th and 14th -* columns (base 0) give the allele counts for the individual or group named -* "fred". - -What it does on Galaxy -The tool converts a SNP table into two tables, called "admix.map" and "admix.ped", needed for estimating the population structure. The user can read or download those files, or simply pass this tool's output on to other programs. The user imposes conditions on which SNPs to consider, such as the minimum coverage and/or quality value for every individual, or the distance to the closest SNP in the same contig (as named in the first column of the SNP table). A useful piece of information produced by the tool is the number of SNPs meeting those conditions, which can be found by clicking on the "eye" after the program runs. - -*/ - -#include "lib.h" - -// bounds line length for a line of the Galaxy table -#define MOST 50000 -struct individual { - int column; - char *name; -} I[MOST/8]; // each individual has 4 columns and 4 tab characters -int nI; // number of individuals -int X[MOST]; // integer values in a row of the SNP table - -// bounds the number of SNPs that can be kept -#define MAX_KEEP 10000000 -char *S[MAX_KEEP]; // S[i] is a row of 2*nI alleles -int nK; - -int main(int argc, char **argv) { - FILE *fp, *ped, *map; - char *p, *z = " \t\n", buf[MOST], trash[MOST], name[100], *s, - scaf[100], prev_scaf[100]; - int i, j, m, min_coverage, min_quality, min_space, nsnp, genotype, - pos, prev_pos; - - if (argc < 5) - fatal("args: Galaxy-table min-cov min-qual min-space 13:fred 16:mary ..."); - min_coverage = atoi(argv[2]); - min_quality = atoi(argv[3]); - min_space = atoi(argv[4]); - - for (i = 5; i < argc; ++i, ++nI) { - if (nI >= MOST/8) - fatal("Too many individuals"); - if (sscanf(argv[i], "%d:%s", &(I[nI].column), name) != 2) - fatalf("bad arg: %s", argv[i]); - I[nI].name = copy_string(name); - } - - map = ckopen("admix.map", "w"); - - fp = ckopen(argv[1], "r"); - prev_scaf[0] = '\0'; - prev_pos = 0; - for (nsnp = 0; fgets(buf, MOST, fp); ) { - if (buf[0] == '#') - continue; - ++nsnp; - if (sscanf(buf, "%s %d", scaf, &pos) != 2) - fatalf("choke: %s", buf); - if (same_string(scaf, prev_scaf)) { - if (pos < prev_pos + min_space) - continue; - } else { - strcpy(prev_scaf, scaf); - prev_pos = -min_space; - } - - // X[i] = atoi(i-th word base-1) - strcpy(trash, buf); - for (i = 1, p = strtok(trash, z); p != NULL; - ++i, p = strtok(NULL, z)) - X[i] = atoi(p); - for (i = 0; i < nI; ++i) { - m = I[i].column; - if (X[m] + X[m+1] < min_coverage || X[m+3] < min_quality) - break; - } - if (i < nI) - continue; - prev_pos = pos; - - if (nK >= MAX_KEEP) - fatal("Too many SNPs"); - fprintf(map, "1 snp%d 0 %d\n", nsnp, nsnp+1); - s = S[nK++] = ckalloc(2*nI*sizeof(char)); - for (i = j = 0; i < nI; ++i, j += 2) { - genotype = X[I[i].column+2]; - if (genotype == 2) - s[j] = s[j+1] = '1'; - else if (genotype == 0) - s[j] = s[j+1] = '2'; - else if (genotype == 1) { - s[j] = '1'; - s[j+1] = '2'; - } else // undefined genotype - s[j] = s[j+1] = '0'; - } - } - - fclose(map); - - ped = ckopen("admix.ped", "w"); - for (i = 0; i < nI; ++i) { - fprintf(ped, "%s 1 0 0 1 1", I[i].name); - for (j = 0; j < nK; ++j) - fprintf(ped, " %c %c", S[j][2*i], S[j][2*i+1]); - putc('\n', ped); - } - - printf("Using %d of %d SNPs\n", nK, nsnp); - fclose(ped); - - return 0; -} |
b |
diff -r 4188853b940b -r a631c2f6d913 genome_diversity/src/aggregate.c --- a/genome_diversity/src/aggregate.c Fri Jul 26 12:51:13 2013 -0400 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 |
[ |
@@ -1,69 +0,0 @@ -/* aggregate -- add four columns (allele counts, genotype, maximum quality) for -* a specified population to a Galaxy SNP table -* -* argv[1] = file containing a Galaxy table -* argv[2] = 0 for a gd_genotype file, 1 for a gd_snp file -* argv[3] ... are the starting columns (base-1) for the chosen individuals - -What it does on Galaxy -The user specifies that some of the individuals in a gd_snp or gd_genotype dataset form a "population", by supplying a list that has been previously created using the Specify Individuals tool. The program appends a new "entity" (set of four columns for a gd_snp table, or one column for a gd_genotype table), analogous to the column(s) for an individual but containing summary data for the population as a group. For a gd_snp table, these four columns give the total counts for the two alleles, the "genotype" for the population, and the maximum quality value, taken over all individuals in the population. If all defined genotypes in the population are 2 (agree with the reference), then the population's genotype is 2, and similarly for 0; otherwise the genotype is 1 (unless all individuals have undefined genotype, in which case it is -1). For a gd_genotype file, only the aggregate genotype is appended. -*/ - -#include "lib.h" - -// most characters allowed in a row of the table -#define MOST 50000 - -// column for the relevant individuals/groups -int col[MOST]; -int nI; - -int main(int argc, char **argv) { - FILE *fp; - char *p, *z = "\t\n", buf[MOST], trash[MOST]; - int X[MOST], m, i, A, B, G, Q, g, gd_snp; - - if (argc < 3) - fatalf("args: SNP-table typedef individual1 ..."); - - gd_snp = atoi(argv[2]); - for (i = 3, nI = 0; i < argc; ++i, ++nI) - col[nI] = atoi(argv[i]); - - fp = ckopen(argv[1], "r"); - while (fgets(buf, MOST, fp)) { - if (buf[0] == '#') - continue; - strcpy(trash, buf); - // set X[i] = atoi(i-th word of s), i is base 0 - for (i = 1, p = strtok(trash, z); p != NULL; - ++i, p = strtok(NULL, z)) - X[i] = atoi(p); - for (i = A = B = Q = 0, G = -1; i < nI; ++i) { - m = col[i]; - if (gd_snp) { - A += X[m]; - B += X[m+1]; - Q = MAX(Q, X[m+3]); - } - g = X[m+2]; - if (g != -1) { - if (G == -1) // first time - G = g; - else if (G != g) - G = 1; - } - } - if (i < nI) // check bounds on the population's individuals - continue; - // add columns - if ((p = strchr(buf, '\n')) != NULL) - *p = '\0'; - if (gd_snp) - printf("%s\t%d\t%d\t%d\t%d\n", buf, A, B, G, Q); - else - printf("%s\t%d\n", buf, G); - } - - return 0; -} |
b |
diff -r 4188853b940b -r a631c2f6d913 genome_diversity/src/coords2admix.c --- a/genome_diversity/src/coords2admix.c Fri Jul 26 12:51:13 2013 -0400 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 |
[ |
@@ -1,91 +0,0 @@ -// coords2admix -- add projections onto chords to information about -// coordinates in PCA plots - -#include "lib.h" - -#define MAX_POP 1000 -struct pop { - char *name; - float x, y; -} P[MAX_POP]; -int nP; - -int main(int argc, char **argv) { - FILE *fp; - char buf[500], x[100], y[100], z[100], cur_pop[100]; - int ncur, i, j, k; - float eig1, eig2, tot_x = 0.0, tot_y = 0.0, x1, y1, x2, y2, a, b, c, d; - - if (argc == 1) - fp = stdin; - else if (argc == 2) - fp = ckopen(argv[1], "r"); - else - fatal("optional arg: smartpca coordinates"); - - if (!fgets(buf, 500, fp)) - fatal("empty set of coordinates"); - if (sscanf(buf, "%s %s %s", x, y, z) != 3 || - !same_string(x, "#eigvals:")) - fatalf("cannot find eigenvalues: %s", buf); - printf("%s", buf); - eig1 = atof(y); - eig2 = atof(z); - //printf("eig1 = %f, eig2 = %f\n", eig1, eig2); - - strcpy(cur_pop, ""); - ncur = 0; - while (fgets(buf, 500, fp)) { - if (sscanf(buf, "%*s %s %s %s", x, y, z) != 3) - fatalf("gag: %s", buf); - printf("%s", buf); - if (!same_string(cur_pop, z)) { - if (ncur > 0) { - P[nP].name = copy_string(cur_pop); - P[nP].x = tot_x/ncur; - P[nP].y = tot_y/ncur; - ++nP; - } - ncur = 1; - strcpy(cur_pop, z); - tot_x = atof(x); - tot_y = atof(y); - } else { - ++ncur; - tot_x += atof(x); - tot_y += atof(y); - } - } - P[nP].name = copy_string(cur_pop); - P[nP].x = tot_x/ncur; - P[nP].y = tot_y/ncur; - ++nP; - -/* -for (i = 0; i < nP; ++i) -printf("%s %f %f\n", P[i].name, P[i].x, P[i].y); -*/ - - // loop over pairs of populations - for (i = 0; i < nP; ++i) { - x1 = eig1*P[i].x; - y1 = eig2*P[i].y; - for (j = i+1; j < nP; ++j) { - printf("\nprojection along chord %s -> %s\n", - P[i].name, P[j].name); - x2 = eig1*P[j].x; - y2 = eig2*P[j].y; - c = (x1-x2)*(x1-x2) + (y1-y2)*(y1-y2); - for (k = 0; k < nP; ++k) - if (k != i && k != j) { - a = eig1*P[k].x; - b = eig2*P[k].y; - d = (x2-x1)*(a-x1) + (y2-y1)*(b-y1); - printf(" %s: %f\n", P[k].name, d/c); - } - } - } - - return 0; -} - |
b |
diff -r 4188853b940b -r a631c2f6d913 genome_diversity/src/coverage.c --- a/genome_diversity/src/coverage.c Fri Jul 26 12:51:13 2013 -0400 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 |
[ |
@@ -1,155 +0,0 @@ -/* coverage -- report distributions of SNP coverage or quality for individuals, -* or coverage for populations -* -* argv{1] -- a Galaxy SNP table. For each individuals, the table has four -* columns (count of each allele, genotype, quality). -* argv[2] -- 0 = sequence coverage, 1 = genotype quality -* argv[3] -- file name for the text version of output (input for producing -* the graphical summary goes to stdout) -* argv[4], argv[5], ..., have the form "13:fred", meaning that the 13th -* 14th, and 16th columns (base 1) give the two allele counts -* and the quality for "fred", where "fred" can be the name of -* a population with several individuals (all named "fred") -What it does on Galaxy -The tool reports distributions of SNP reliability indicators for individuals or populations. The reliability can be measured by either the sequence coverage or the SAMtools quality value, though the notion of a population-level quality is not supported. Textual and graphical reports are generated, where the text output gives the cumulative distributions. -*/ - -#include "lib.h" - -// maximum length of a line from the table -#define MOST 50000 - -// the largest coverage or quality value being considered -#define MAX_VAL 5000 - -FILE *gp; // for text output - -// a population is the set of all indivuals with the same name -// (perhaps just a single individual) -struct pop { - int cov, n[MAX_VAL+1]; - long long sum, tot; - char *name; -} P[MOST/4]; -int nP; // number of populations - -// maps column to population -struct individual { - int col, pop; -} I[MOST/4]; -int nI; - -/* Report the distribution for each individual. P[i].n[k] is the number of SNPs -* of value (coverage or quality) k in population i, for k < MAX_VAL; -* I[i].n[MAX_VAL] is the number of SNPs of value k >= MAX_VAL. -* We print the percentages, p, of SNPs with value <= k, ending when all -* populations have reached a p >= 98%. -*/ -void print_cov() { - int i, j, k, last_j; - long long sum; - - // find where to stop printing - for (last_j = i = 0; i < nP; ++i) { - for (sum = j = 0; j <= MAX_VAL; ++j) - sum += P[i].n[j]; - P[i].tot = sum; - for (sum = j = 0; j <= MAX_VAL; ++j) { - sum += P[i].n[j]; - if (sum >= 0.98*P[i].tot) - break; - } - last_j = MAX(last_j, j); - } - - - ++last_j; - // print to stdout the output for graphing; not broken into short lines - for (j = 0; j < last_j; ++j) - printf("\t%3d", j); - putchar('\n'); - for (i = 0; i < nP; ++i) { - printf("%s", P[i].name); - for (sum = j = 0; j < last_j; ++j) { - sum += P[i].n[j]; - printf("\t%4.2f", 100.0*(float)sum/(float)P[i].tot); - } - putchar('\n'); - } - - // print a user-friendly version to the named file - // <= 20 numbers per row - for (j = 0; j < last_j; j += 20) { - fprintf(gp, "\n "); - for (k = j; k < MIN(j+20, last_j); ++k) - fprintf(gp, "%3d", k); - for (i = 0; i < nP; ++i) { - fprintf(gp, "\n%10s", P[i].name); - for (k = j; k < MIN(j+20, last_j); ++k) { - P[i].sum += P[i].n[k]; - fprintf(gp, "%3lld", - MIN(99, 100*P[i].sum/P[i].tot)); - } - } - fprintf(gp,"\n\n"); - } -} - -int main(int argc, char **argv) { - FILE *fp; - char buf[MOST], *z = " \t\n", *p; - int X[MOST], i, j, cov, m, quality, is_pop; - - if (argc < 5) - fatal("args: SNP-file quality-value? out-name 13:fred ... "); - quality = atoi(argv[2]); - gp = ckopen(argv[3], "w"); - // record the individuals and populations - for (nI = 0, i = 4; i < argc; ++i, ++nI) { - if (nI >= MOST) - fatal("Too many individuals"); - // allow spaces in names - if ((p = strchr(argv[i], ':')) == NULL) - fatalf("no colon: %s", argv[i]); - I[nI].col = atoi(argv[i]); - for (j = 0; j < nP && !same_string(p+1, P[j].name); ++j) - ; - if (j == nP) { // new population - is_pop = 1; - P[nP++].name = copy_string(p+1); - } - I[nI].pop = j; - } - if (is_pop && quality) - fatal("quality values for a population are not supported."); - - // Record the number of SNPs with coverage 0, 1, ..., MAX_VAL-1, - // or >= MAX_VAL for each individual. - fp = ckopen(argv[1], "r"); - while (fgets(buf, MOST, fp)) { - if (buf[0] == '#') - continue; - // P[i].cov is the total coverage for all individuals in pop i - for (i = 0; i < nP; ++i) - P[i].cov = 0; - // X[i] = atoi(i-th word base-1) - for (i = 1, p = strtok(buf, z); p != NULL; - ++i, p = strtok(NULL, z)) - X[i] = atoi(p); - for (i = 0; i < nI; ++i) { - m = I[i].col; - if (quality) - cov = X[m+3]; - else - cov = X[m] + X[m+1]; - P[I[i].pop].cov += cov; - } - for (i = 0; i < nP; ++i) - P[i].n[MIN(P[i].cov, MAX_VAL)]++; - } - - // Print the distributions. - print_cov(); - - return 0; -} |
b |
diff -r 4188853b940b -r a631c2f6d913 genome_diversity/src/dist_mat.c --- a/genome_diversity/src/dist_mat.c Fri Jul 26 12:51:13 2013 -0400 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 |
[ |
@@ -1,196 +0,0 @@ -/* dist_mat -- create a distance matrix in PHYLIP format for pairs of -* specified individuals, including by default the reference sequence -* -* argv[1] -- a Galaxy SNP table -* argv[2] -- min coverage -* argv[3] -- min quality -* argv[4] -- name of reference species (or "none") -* argv[5] -- 0 = distance from coverage; 1 = distance from genotype -* argv[6] -- name of file for the numbers of informative SNPs -* argv[7] -- name of file to write the Mega-format distance matrix -* argv[k] for k > 7 have the form "13:fred", meaning that the 13th and 14th -* columns (base 0) give the allele counts for the individual or group named -* "fred". - -What it does on Galaxy -This tool uses the selected SNP table to determine a "genetic distance" between each pair of selected individuals; the table of pairwise distances can be used by the Neighbor-Joining methods to construct a tree that depicts how the individuals are related. For a given pair of individuals, we find all SNP positions where both individuals have at least a minimum number of sequence "reads"; the individuals' distance at that SNP is defined as the absolute value of difference in the frequency of the first allele (equivalently: the second allele). For instance, if the first individuals has 5 reads of each allele and the second individual has respectivley 3 and 6 reads, then the frequencies are 1/2 and 1/3, giving a distance 1/6 at that SNP (provided that the minimum read total is at most 9). The output includes a report of the numbers of SNPs passing that thresold for each pair of individuals. - -*/ - -#include "lib.h" - -// bounds line length for a line of the Galaxy table - -#define MOST 50000 -#define MIN_SNPS 3 - -struct argument { - int column; - char *name; -} A[MOST]; -int nA; // number of individuals or groups + 1 (for the reference species) - -#define MOST_INDIVIDUALS 1000 -#define SIZ 1+MOST_INDIVIDUALS // includes the reference - -double tot_diff[SIZ][SIZ]; -int ndiff[SIZ][SIZ], X[MOST]; - -int main(int argc, char **argv) { - FILE *fp, *gp, *mega; - char *p, *z = "\t\n", buf[MOST], name[100], B[100], C[100], D[100], - *nucs = "ACGT"; - int i, j, m, n, min_coverage, too_few, ref_allele = -1, has_ref, - min_quality, genotype; - double fi, fj, dist; - - if (argc < 8) - fatal("args: Galaxy-table min-cov min-qual min-snp ref-name genotype dist-out mega-out 13:fred 16:mary ..."); - min_coverage = atoi(argv[2]); - min_quality = atoi(argv[3]); - genotype = atoi(argv[5]); - if (!genotype && min_coverage <= 0 && min_quality <= 0) - fatal("coverage and/or quality of SNPs should be constrained"); - - if (same_string(argv[4], "none")) - has_ref = 0; - else { - has_ref = 1; - A[0].name = copy_string(argv[4]); - } - gp = ckopen(argv[6], "w"); - mega = ckopen(argv[7], "w"); - fprintf(mega, "#mega\n!Title: Galaxy;\n"); - - for (nA = has_ref, i = 8; i < argc; ++i, ++nA) { - if (nA >= SIZ) - fatal("Too many individuals"); - if (sscanf(argv[i], "%d:%s", &(A[nA].column), name) != 2) - fatalf("bad arg: %s", argv[i]); - A[nA].name = copy_string(name); - } - fprintf(mega, - "!Format DataType=Distance DataFormat=LowerLeft NTaxa=%d;\n\n", - nA); - for (i = 0; i < nA; ++i) - fprintf(mega, "[%d] #%s\n", i+1, A[i].name); - fprintf(mega, "\n\n\n["); - for (i = 1; i <= nA; ++i) - fprintf(mega, "%4d", i); - fprintf(mega, " ]\n"); - fp = ckopen(argv[1], "r"); - while (fgets(buf, MOST, fp)) { - if (buf[0] == '#') - continue; - if (has_ref) { - // get the reference allele - if (sscanf(buf, "%*s %*s %s %s %*s %*s %*s %s", B, C, D) - != 3) - fatalf("3 fields: %s", buf); - if (strchr(nucs, B[0]) == NULL || - strchr(nucs, C[0]) == NULL) - fatalf("not nucs : %s %s", B, C); - if (D[0] == B[0]) - ref_allele = 1; - else if (D[0] == C[0]) - ref_allele = 2; - else if (strchr(nucs, D[0]) != NULL) - ref_allele = 3; - else { - if (D[0] != '-' && D[0] != 'N') - fatalf("what is this: %s", D); - ref_allele = -1; - } - } - - // X[i] = atoi(i-th word base-1) - for (i = 1, p = strtok(buf, z); p != NULL; - ++i, p = strtok(NULL, z)) - X[i] = atoi(p); - for (i = has_ref; i < nA; ++i) { - m = A[i].column; - if (X[m] + X[m+1] < min_coverage || - X[m+3] < min_quality) - continue; - - // frequency of the second allele - if (genotype) { - if (X[m+2] == -1) - continue; // no genotype - fi = (double)X[m+2]; - } else - fi = (double)X[m+1] / (double)(X[m]+X[m+1]); - if (has_ref && ref_allele > 0) { - ndiff[0][i]++; - // reference allele might be different from both - if (ref_allele == 1) - tot_diff[0][i] += fi; - else if (ref_allele == 2) - tot_diff[0][i] += (1.0 - fi); - else - tot_diff[0][i] += 1.0; - } - for (j = i+1; j < nA; ++j) { - n = A[j].column; - if (X[n] + X[n+1] < min_coverage || - X[n+3] < min_quality) - continue; - if (genotype && X[n+2] == -1) - continue; - ndiff[i][j]++; - if (genotype) - fj = (double)X[n+2]; - else - fj = (double)X[n+1] / - (double)(X[n] + X[n+1]); - fj -= fi; - // add abs. value of difference in frequencies - tot_diff[i][j] += (fj >= 0.0 ? fj : -fj); - } - - } - } - for (i = too_few = 0; i < nA; ++i) - for (j = i+1; j < nA; ++j) - if (ndiff[i][j] < MIN_SNPS) { - too_few = 1; - fprintf(stderr, - "%s and %s have only %d informative SNPs\n", - A[i].name, A[j].name, ndiff[i][j]); - } - if (too_few) - fatal("remove individuals or relax constraints"); - - // print distances - printf("%d\n", nA); - for (i = 0; i < nA; ++i) { - printf("%9s", A[i].name); - fprintf(mega, "[%d] ", i+1); - for (j = 0; j < i; ++j) { - dist = tot_diff[j][i]/(double)ndiff[j][i]; - printf(" %6.4f", dist); - fprintf(mega, " %6.4f", dist); - } - fprintf(mega, " \n"); - printf(" 0.0000"); - for (j = i+1; j < nA; ++j) - printf(" %6.4f", - tot_diff[i][j]/(double)ndiff[i][j]); - putchar('\n'); - } - fprintf(mega, "\n\n\n\n\n"); - fclose(mega); - - // print numbers of SNPs - for (i = 0; i < nA; ++i) { - fprintf(gp, "%9s", A[i].name); - for (j = 0; j < i; ++j) - fprintf(gp, " %8d", ndiff[j][i]); - fprintf(gp, " 0"); - for (j = i+1; j < nA; ++j) - fprintf(gp," %8d", ndiff[i][j]); - putc('\n', gp); - } - - return 0; -} |
b |
diff -r 4188853b940b -r a631c2f6d913 genome_diversity/src/dpmix.c --- a/genome_diversity/src/dpmix.c Fri Jul 26 12:51:13 2013 -0400 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 |
[ |
b'@@ -1,641 +0,0 @@\n-/* dpmix -- admixture using dynamic programming; 2 or 3 source populations\n-*\n-* argv{1] = a Galaxy SNP table. For each individual, the table may have\n-* four columns (#A, #B, genotype, quality), or only one\n-*\t (genotype). SNPs on the same chromosome must appear together,\n-*\t and in order of position\n-* argv[2] = column with the chromosome name (position is the next column)\n-* argv[3] = "all" or e.g., "chr20"\n-* argv[4] = 1 if source-pop allele frequencies are estimated from genotypes;\n-*\t 0 means use read-coverage data.\n-* argv[5] = 1 to add logarithms of probabilities;\n-*\t 0 to simply add probabilities\n-* argv[6] = switch penalty (>= 0)\n-* argv[7] = file giving heterochromatic intervals (\'-\' = no file is given)\n-* argv[8] = file name for additional output\n-* argv[9], argv[10], ... = like "13:1:Peter", "13:2:Paul", "13:3:Sam"\n-*\t or "13:0:Mary", meaning that the 13th and 14th columns (base 1)\n-*\t give the allele counts for an individual that is in source\n-*\t population 1, source population 2, source population 3,\n-*\t or is a potentially admixed individual, resp. Even when only the\n-*\t genotype is given, it is found in column 15.\n-* optional last arguments = "debug"\n-\n-What it does on Galaxy\n-The user specifies two or three source populations (i.e., sources for chromosomes) and a set of potentially admixed individuals, and chooses between the sequence coverage or the estimated genotypes to measure the similarity of genomic intervals in admixed individuals to the source chromosomes. The user also specifies a "switch penalty", controlling the strength of evidence needed to switch between source populations as the the program scans along a chromosome.\n-\n-Choice of an appropriate switch penalty depends on the number of SNPs and the time since the admixture event(s), and it is not always easy to determine a good setting ahead of time. One approach is to try, say, 10.0, and look at the resulting picture. If the admixture blocks are shorter than anticipated, based on expectations about the number of generations since the admixture event(s) (i.e., the implied admixture events are too ancient), then try a larger value. Conversely, if the blocks are longer than anticipated (i.e., the implied timing of admixture is too recent), then lower the penalty. In any case, it is prudent to base conclusions on results consistently obtained with a range of switch penalities.\n-\n-If there are 3 source populatons, then for each potentially admixed individual the program divides each potentially admixed genome into six "genotypes":\n-\n-(1) homozygous for the first source population (i.e., both chromosomes from that population),\n-(2) homozygous for the second source population,\n-(3) homozygous for the third source population,\n-(4) heterozygous for the first and second populations (i.e., one chromosome from each),\n-(5) heterozygous for the first and third populations, or\n-(6) heterozygous for the second and third populations.\n-\n-Parts of a reference chromosome that are labeled as "heterochromatic" are given the "non-genotype" 0. With two source populations, there are only three possible "genotypes".\n-\n-There are two Galaxy history-items generated. The first is a tabular dataset with chromosome, start, stop, and pairs of columns containing the "genotypes" as described above and the label from the admixed individual. It is used to draw a pciture and can be consulted for details. The second item is a composite of (1) a link to a pdf which graphically shows the inferred source population for each potentially admixed individual along each chromosome, and (2) a link to a text file that describes the run and summarizes the extent of predicted admixture in each individual.\n-\n-For certain genome assemblies, Galaxy stores a "heterochromatin file", which specifies the heterochromatic intervals on each chromosome, as well as telling the chromosome le'..b';\n-\t\t\t// assumes last event per chrom. is a het. interval\n-\t\t\thet_len += (j - i);\n-\t\t\t++nH;\n-\t\t}\n-\t\tfclose(fp);\n-\t}\n-\tref_len += H[nH-1].e;\n-\n-\t// populations must be disjoint\n-\tsaw[0] = saw[1] = saw[2] = saw[3] = 0;\n-\tfor (i = 9; i < argc; ++i) {\n-\t\tif (sscanf(argv[i], "%d:%d:%s", &j, &k, nam) != 3)\n-\t\t\tfatalf("not like 13:2:fred : %s", argv[i]);\n-\t\tif (k < 0 || k > 3)\n-\t\t\tfatalf("not population 0, 1, 2 or 3: %s", argv[i]);\n-\t\tsaw[k] = 1;\n-\n-\t\t// seen this individual (i.e., column) before??\n-\t\tfor (n = 0; n < nI && C[n].col != j; ++n)\n-\t\t\t;\n-\t\tif (n < nI)\n-\t\t\tfatal("populations are not disjoint");\n-\t\tif (k == 0) {\t// admixed individual\n-\t\t\tif (nG >= MOST)\n-\t\t\t\tfatal("Too many admixed individuals");\n-\t\t\tA[nG].name = copy_string(nam);\n-\t\t\tA[nG++].gcol = j+2;\n-\t\t} else {\t// in a source population\n-\t\t\tif (nI >= MOST)\n-\t\t\t\tfatal("Too many ancestral individuals");\n-\t\t\tC[nI].col = j;\n-\t\t\tC[nI].pop = k;\n-\t\t\tC[nI++].name = copy_string(nam);\n-\t\t}\n-\t}\n-\tif (saw[0] == 0)\n-\t\tfatal("no admixed individual is specified");\n-\tif (saw[1] == 0)\n-\t\tfatal("first reference population is empty");\n-\tif (saw[2] == 0)\n-\t\tfatal("second reference population is empty");\n-\tpop3 = saw[3];\n-\n-\t// start the output file of text\n-\tfor (k = 1; k <= 3; ++k) {\n-\t\tif (k == 3 && !pop3)\n-\t\t\tbreak;\n-\t\tfprintf(out, "source population %d (state %d):", k, k);\n-\t\tfor (i = 0; i < nI; ++i)\n-\t\t\tif (C[i].pop == k)\n-\t\t\t\tfprintf(out, " %s", C[i].name);\n-\t\tfprintf(out, "\\n\\n");\n-\t}\n-\tif (pop3) {\n-\t\tfprintf(out, "state 4 is heterozygous for populations 1 and 2\\n");\n-\t\tfprintf(out,\n-\t\t "state 5 is heterozygous for populations 1 and 3\\n");\n-\t\tfprintf(out,\n-\t\t "state 6 is heterozygous for populations 2 and 3\\n");\n-\t} else\n-\t\tfprintf(out, "state 3 is heterozygous for populations 1 and 2\\n");\n-\tfprintf(out, "\\nswitch penalty = %2.2f\\n", switch_penalty);\n-\tputc(\'\\n\', out);\n-\n-\tfp = ckopen(argv[1], "r");\n-\twhile ((status = fgets(buf, MOST, fp)) != NULL && buf[0] == \'#\')\n-\t\t;\n-\tif (same_string(chr, "all"))\n-\t\twhile (status != NULL)\n-\t\t\tone_chr();\n-\telse {\t// skip to the specified chromosome\n-\t\twhile (!same_string(chr, get_chr_name()) &&\n-\t\t (status = fgets(buf, MOST, fp)) != NULL)\n-\t\t\t;\n-\t\tif (status != NULL)\n-\t\t\tone_chr();\n-\t}\n-\n-\tif (ref_len)\n-\t\tfprintf(out,\n-\t\t "%lld of %lld reference bp (%1.1f%%) are heterochromatin\\n\\n",\n-\t\t het_len, ref_len, 100.0*(float)het_len/(float)ref_len);\n-\n-\t// write fractions in each state to the output text file\n-\tfor (j = 0; j < 7; ++j)\n-\t\ttot[j] = 0.0;\n-\tfprintf(out, "individual:");\n-\tfprintf(out, "\\tstate 1\\tstate 2\\tstate 3");\n-\tif (pop3)\n-\t\tfprintf(out, "\\tstate 4\\tstate 5\\tstate 6");\n-\tfprintf(out, "\\t pop 1\\t pop 2");\n-\tif (pop3)\n-\t\tfprintf(out, "\\t pop 3");\n-\tputc(\'\\n\', out);\n-\tfor (i = 0; i < nG; ++i) {\n-\t\tN = A[i].x[1] + A[i].x[2] + A[i].x[4];\n-\t\tif (pop3)\n-\t\t\tN += A[i].x[3] + A[i].x[5] + A[i].x[6];\n-\t\tN /= 100.0;\n-\t\tfprintf(out, "%s:", A[i].name);\n-\t\tif (strlen(A[i].name) < 7)\n-\t\t\tputc(\'\\t\', out);\n-\t\tfor (j = 1; j < 7; ++j)\n-\t\t\tif (pop3 || j == 1 || j == 2 || j == 4) {\n-\t\t\t\ttot[j] += (keep[j] = A[i].x[j]);\n-\t\t\t\tfprintf(out, "\\t %5.1f%%", keep[j]/N);\n-\t\t\t}\n-\t\tkeep[1] += 0.5*keep[4];\n-\t\tkeep[2] += 0.5*keep[4];\n-\t\tif (pop3) {\n-\t\t\tkeep[1] += 0.5*keep[5];\n-\t\t\tkeep[2] += 0.5*keep[6];\n-\t\t\tkeep[3] += 0.5*(keep[5]+keep[6]);\n-\t\t}\n-\n-\t\tfprintf(out, "\\t %5.1f%%\\t %5.1f%%", keep[1]/N, keep[2]/N);\n-\t\tif (pop3)\n-\t\t\tfprintf(out, "\\t %5.1f%%", keep[3]/N);\n-\t\t\t\n-\t\tputc(\'\\n\', out);\n-\t}\n-\tif (nG > 1) {\n-\t\tfprintf(out, "\\naverage: ");\n-\t\tN = tot[1] + tot[2] + tot[4];\n-\t\tif (pop3)\n-\t\t\tN += (tot[3] + tot[5] + tot[6]);\n-\t\tN /= 100.0;\n-\t\tfor (j = 1; j < 7; ++j) {\n-\t\t\tif (pop3 || j == 1 || j == 2 || j == 4)\n-\t\t\t\tfprintf(out, "\\t %5.1f%%", tot[j]/N);\n-\t\t}\n-\t\txx = tot[1] + 0.5*tot[4];\n-\t\tyy = tot[2] + 0.5*tot[4];\n-\t\tif (pop3) {\n-\t\t\txx += 0.5*tot[5];\n-\t\t\tyy += 0.5*tot[6];\n-\t\t}\n-\t\tfprintf(out, "\\t %5.1f%%\\t %5.1f%%", xx/N, yy/N);\n-\t\tif (pop3)\n-\t\t\tfprintf(out, "\\t %5.1f%%",\n-\t\t\t (tot[3] + 0.5*tot[5] + 0.5*tot[6])/N);\n-\t\tputc(\'\\n\', out);\n-\t}\n-\n-\treturn 0;\n-}\n' |
b |
diff -r 4188853b940b -r a631c2f6d913 genome_diversity/src/eval2pct.c --- a/genome_diversity/src/eval2pct.c Fri Jul 26 12:51:13 2013 -0400 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 |
[ |
@@ -1,26 +0,0 @@ -#include "lib.h" - -#define MAX_EVAL 1000 - -float E[MAX_EVAL]; -int nE; - -int main (int argc, char **argv) { - FILE *fp; - char buf[500]; - int i; - float tot; - - fp = (argc== 1 ? stdin : ckopen(argv[1], "r")); - while (fgets(buf, 500, fp)) { - if (nE >= MAX_EVAL) - fatal("Too many eigenvalues"); - E[nE++] = atof(buf); - } - for (tot = 0.0, i = 0; i < nE; ++i) - tot += E[i]; - printf("Percentage explained by eigenvectors:\n"); - for (i = 0 ; i < nE && E[i] > 0.0; ++i) - printf("%d: %1.1f%%\n", i+1, 100.0*(float)E[i]/tot); - return 0; -} |
b |
diff -r 4188853b940b -r a631c2f6d913 genome_diversity/src/filter_snps.c --- a/genome_diversity/src/filter_snps.c Fri Jul 26 12:51:13 2013 -0400 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 |
[ |
@@ -1,165 +0,0 @@ -/* filter_snps -- enforce constraints on a file of SNPs. -* -* argv[1] = file containing a Galaxy table -* argv[2] = 1 for a gd_snp file, 0 for gd_genotype -* argv[3] = lower bound on total coverage (< 0 means interpret as percentage) -* argv[4] = upper bound on total coveraae (< 0 means interpret as percentage; -* 0 means no bound) -* argv[5] = lower bound on individual coverage -* argv[6] = lower bound on individual quality value -* argv[7] = lower bound on the number of defined genotypes -* argv[8] = lower bound on the spacing between SNPs -* argv[9] = reference-chromosome column (base-1); ref position in next column -* If argv[8] == 0, then argv[7] must be 0. -* argv[10] ... are the starting columns (base-1) for the chosen individuals. -* If argc == 10, then only the lower bound on spacing between SNPs is -* enforced. - -What it does on Galaxy -For a gd_snp dataset, the user specifies that some of the individuals form a "population", by supplying a list that has been previously created using the Specify Individuals tool. SNPs are then discarded if their total coverage for the population is too low or too high, or if their coverage or quality score for any individual in the population is too low. - -The upper and lower bounds on total population coverage can be specified either as read counts or as percentiles (e.g. "5%", with no decimal places). For percentile bounds the SNPs are ranked by read count, so for example, a lower bound of "10%" means that the least-covered 10% of the SNPs will be discarded, while an upper bound of, say, "80%" will discard all SNPs above the 80% mark, i.e. the top 20%. The threshold for the lower bound on individual coverage can only be specified as a plain read count. - -For either a gd_snp or gd_genotype dataset, the user can specify a minimum number of defined genotypes (i.e., not -1) and/or a minimum spacing relative to the reference sequence. An error is reported if the user requests a minimum spacing but no reference sequence is available. - -*/ - -#include "lib.h" - -// most characters allowed in a row of the table -#define MOST 50000 - -// larger than any possible coverage -#define INFINITE_COVERAGE 100000000 - -char buf[MOST], chr[100], old_chr[100]; - -// column for the relevant individuals/groups -int col[MOST], *T; -int nI, lo, hi, min_space, min_geno, chr_col, pos, old_pos, gd_snp, X[MOST]; - -void get_X() { - char *p, *z = "\t\n", trash[MOST]; - int i; - - strcpy(trash, buf); - // set X[i] = atoi(i-th word of s), i is base 0 - for (i = 1, p = strtok(trash, z); p != NULL; - ++i, p = strtok(NULL, z)) { - if (chr_col && i == chr_col) - strcpy(chr, p); - X[i] = atoi(p); - if (chr_col && i == chr_col+1) - pos = X[i]; - } -} - -int compar(int *a, int *b) { - if (*a < *b) - return -1; - if (*a > *b) - return 1; - return 0; -} - -void find_lo(char *filename) { - FILE *fp = ckopen(filename, "r"); - int n, m, i, k; - - for (n = 0; fgets(buf, MOST, fp); ++n) - ; - T = ckalloc(n*sizeof(int)); - rewind(fp); - for (k = 0; fgets(buf, MOST, fp); ++k) { - get_X(); - for (i = T[k] = 0; i < nI; ++i) { - m = col[i]; - T[k] += (X[m]+X[m+1]); - } - } - qsort((void *)T, (size_t)n, sizeof(int), (const void *)compar); - if (lo < 0) { - lo = -lo; - if (lo > 100) - fatal("cannot have low > 100%"); - lo = T[(n*lo)/100]; - } - if (hi < 0) { - hi = -hi; - if (hi > 100) - fatal("cannot have high > 100%"); - hi = T[(n*hi)/100]; - } - free(T); - fclose(fp); -} - -void OK() { - if (chr_col == 0 || !same_string(chr, old_chr) || - pos >= old_pos + min_space) { - printf("%s", buf); - if (chr_col) { - strcpy(old_chr, chr); - old_pos = pos; - } - } -} -int main(int argc, char **argv) { - FILE *fp; - int m, i, cov, tot_cov, indiv, qual, ngeno; - - if (argc < 10) - fatalf("args: SNP-table gd_snp low-tot high-tot low-cov low-qual low-genotype low-space chr-col col1 col2 ..."); - - for (i = 10, nI = 0; i < argc; ++i, ++nI) - col[nI] = atoi(argv[i]); - gd_snp = atoi(argv[2]); - lo = atoi(argv[3]); - hi = atoi(argv[4]); - if (hi == 0) - hi = INFINITE_COVERAGE; - if (lo < 0 || hi < 0) - find_lo(argv[1]); - indiv = atoi(argv[5]); - qual = atoi(argv[6]); - min_geno = atoi(argv[7]); - min_space = atoi(argv[8]); - chr_col = atoi(argv[9]); - - // reality checks - if (!gd_snp && - (lo != 0 || hi != INFINITE_COVERAGE || indiv != 0 || qual != 0)) - fatal("cannot bound coverage or quality in gd_genotype file"); - if (chr_col == 0 && min_space != 0) - fatalf("Specification of minimum spacing requires a reference sequence"); - if (indiv < 0 || qual < 0) - fatalf("percentiles not implemented for individuals"); - - // scan the SNPs - fp = ckopen(argv[1], "r"); - while (fgets(buf, MOST, fp)) { - if (buf[0] == '#') - continue; - get_X(); - for (i = tot_cov = ngeno = 0; i < nI; ++i) { - m = col[i]; - if (gd_snp) { - cov = (X[m]+X[m+1]); - if (cov < indiv || X[m+3] < qual) - break; - tot_cov += cov; - } - if (X[m+2] != -1) - ++ngeno; - } - if (i < nI) // check bounds on the population's individuals - continue; - if (gd_snp && (tot_cov < lo || tot_cov > hi)) - continue; - if (ngeno >= min_geno) - // OK, except possibly for lower bound on spacing - OK(); - } - - return 0; -} |
b |
diff -r 4188853b940b -r a631c2f6d913 genome_diversity/src/get_pi.c --- a/genome_diversity/src/get_pi.c Fri Jul 26 12:51:13 2013 -0400 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 |
[ |
b'@@ -1,282 +0,0 @@\n-/* get_pi -- compute piNon, piSyn, thetaNon and thetaSyn\n-*\n-* argv[1] -- SAPs file\n-* argv[2] -- SNPs file\n-* argv[3] -- covered intervals file\n-* argv[4], argv[5], ... -- starting columns in the SNP file for the chosen\n-* individuals\n-*\n-* We assume that lines of argv[1], argv[2] and argv[3] start with a scaffold\n-* name and a scaffold position, and that they are sorted on those two fields.\n-* The 4th entry in an interval line gives the reference chromosome. We ignore\n-* unnumbered chromosome, e.g., "chrX".\n-*\n-* Output:\n-* the number of nonsyn SNPs, total number of nonsynon sites, piNon,\n-* the number of synon SNPs, total number of synon sites, piSyn, plus\n-* total length of covered intervals, thetaNon, thetaSyn\n-*\n-* What it does on Galaxy\n-The tool computes values that estimate some basic parameters\n-*/ \n-\n-#include "lib.h"\n-\n-// END_FILE should be larger than any real scaffold number\n-#define END_FILE 1000000000 // scaffold "number" signifying end-of-file\n-#define BUF_SIZE 50000\t// maximum length of a SNP-file row\n-\n-int col[10000], nC; // columns containing the chosen genotypes\n-\n-// scaffold numbers and positions of the current SAP, SNP, and interval\n-int nbr_SAP, nbr_SNP, nbr_interv, pos_SAP, pos_SNP, beg, end, columns, debug;\n-\n-// current SNP row, the variant amino acids of the SAP, interval\'s reference chr\n-char snp[BUF_SIZE], A[100], B[100], chr[100];\n-\n-// number of nonsynon and snynon sites in the current interval\n-float all_non, all_syn;\n-\n-// return the number of chromosome pairs that differ at a SNP\n-int diff_pair() {\n-\tint i, j, X[1000];\n-\tchar *p, *z = "\\t\\n";\n-\n-\t// set X[i] = atoi(i-th word of SNP-file row), base 1\n-\tfor (i = 1, p = strtok(snp, z); p != NULL;\n-\t ++i, p = strtok(NULL, z))\n-\t\tX[i] = atoi(p);\n-\t// add genotypes to count the reference allele\n-\tfor (j = i = 0; i < nC; ++i)\n-\t\tj += X[col[i]];\n-\t// of the 2*nC chromosomes, j have the reference nucleotide\n-\tif (debug)\n-\t\tprintf("get_pi: j = %d, return %d\\n", j, j*(2*nC-j));\n-\treturn j*(2*nC-j);\n-}\n-\n-// translate e.g. "scaffold234" to the integer 234\n-int name2nbr(char *s) {\n-\tif (same_string(s, "chrX"))\n-\t\treturn 1000;\n-\tif (same_string(s, "chrY"))\n-\t\treturn 1001;\n-\twhile (isalpha(*s))\n-\t\t++s;\n-\treturn atoi(s);\n-}\n-\n-// does one scaffold-position pair strictly precede another\n-int before(int nbra, int posa, int nbrb, int posb) {\n-\treturn (nbra < nbrb || (nbra == nbrb && posa < posb));\n-}\n-\n-// get a SAP; set A and B; set nbr_SAP = END_FILE for end-of-file\n-void get_SAP(FILE *fp) {\n-\tchar buf[500], scaf_name[100];\n-\tint old_nbr = nbr_SAP, old_pos = pos_SAP;\n-\n-\tif (nbr_SAP >= END_FILE)\n-\t\treturn;\n-\tif (!fgets(buf, 500, fp)) {\n-\t\tnbr_SAP = END_FILE;\n-\t\treturn;\n-\t}\n-\twhile (buf[0] == \'#\')\n-\t\tif (!fgets(buf, 500, fp)) {\n-\t\t\tnbr_SAP = END_FILE;\n-\t\t\treturn;\n-\t\t}\n-\tif (columns == 8) {\n-\t\tif (sscanf(buf, "%s %d %*s %*s %*s %*s %s %*s %s",\n-\t\t scaf_name, &pos_SAP, A, B) != 4)\n-\t\t\tfatalf("bad SAP : %s", buf);\n-\t} else if (columns == 5) {\n-\t\tif (sscanf(buf, "%s %d %*s %*s %s %*s %s",\n-\t\t scaf_name, &pos_SAP, A, B) != 4)\n-\t\t\tfatalf("bad SAP : %s", buf);\n-\t} else\n-\t\tfatalf("get_SAP: columns = %d", columns);\n-\tnbr_SAP = name2nbr(scaf_name);\n-\tif (before(nbr_SAP, pos_SAP, old_nbr, old_pos))\n-\t\tfatalf("SAP at scaf%d %d is out of order", nbr_SAP, pos_SAP);\n-\tif (debug)\n-\t\tprintf("SAP: scaf%d %d\\n", nbr_SAP, pos_SAP);\n-}\n-\n-// get a SNP\n-void get_SNP(FILE *fp) {\n-\tchar scaf_name[100];\n-\tint old_nbr = nbr_SNP, old_pos = pos_SNP;\n-\n-\tif (nbr_SNP >= END_FILE)\n-\t\treturn;\n-\tif (!fgets(snp, BUF_SIZE, fp)) {\n-\t\tnbr_SNP = END_FILE+1;\n-\t\treturn;\n-\t}\n-\twhile (snp[0] == \'#\')\n-\t\tif (!fgets(snp, 500, fp)) {\n-\t\t\tnbr_SNP = END_FILE+1;\n-\t\t\treturn;\n-\t\t}\n-\tif (sscanf(snp, "%s %d", scaf_name, &pos_SNP) != 2)\n-\t\tfatalf("bad SNP : %s", snp);\n-\tnbr_SNP = name2nbr(scaf_name);\n-\tif (before(nbr_SNP, pos_SNP, old_nbr, old_pos)) {\n-\t\tfprintf(stderr, "seq%d %d before seq%d %d\\n",\n-\t\t nbr_SNP, pos_SNP, old_nbr, old_pos);\n-\t\tfatalf("SNP at se'..b'.1 < remain)\n-\t\td += third;\n-\tif (0.5 < remain)\n-\t\td += third;\n-\treturn d;\n-}\n-\n-// read an interval; update tot_non and tot_syn\n-int get_interval(FILE *fp) {\n-\tchar buf[500], scaf_name[100], tmp[500], *t, *z = " \\t\\n";\n-\tint old_nbr = nbr_interv, old_end = end;\n-\n-\tif (!fgets(buf, 500, fp))\n-\t\treturn 0;\n-\twhile (buf[0] == \'#\')\n-\t\tif (!fgets(buf, 500, fp))\n-\t\t\treturn 0;\n-\tif (columns == 0) {\n-\t\tstrcpy(tmp, buf);\n-\t\tfor (columns = 0, t = strtok(tmp, z); t != NULL;\n-\t\t ++columns, t = strtok(NULL, z))\n-\t\t\t;\n-\t} \n-\tif (columns != 5 && columns != 8)\n-\t\tfatalf("interval table has %d columns", columns);\n-\tif (columns == 8 && sscanf(buf, "%s %d %d %s %*s %*s %f %f",\n-\t scaf_name, &beg, &end, chr, &all_non, &all_syn) != 6)\n-\t\tfatalf("bad interval : %s", buf);\n-\tif (columns == 5) {\n-\t\tif (sscanf(buf, "%s %d %d %f %f",\n-\t\t scaf_name, &beg, &end, &all_non, &all_syn) != 5)\n-\t\t\tfatalf("bad interval : %s", buf);\n-\t\tstrcpy(chr, scaf_name);\n-\t}\n-\tnbr_interv = name2nbr(scaf_name);\n-\tif (before(nbr_interv, beg, old_nbr, old_end))\n-\t\tfatalf("interval at %s %d is out of order", scaf_name, beg);\n-\tif (debug)\n-\t\tprintf("int: scaf%d %d-%d\\n", nbr_interv, beg, end);\n-\t\t\n-\treturn 1;\n-}\n-\n-int main(int argc, char **argv) {\n-\tFILE *fp1, *fp2, *fp3;\n-\tint i, nint, nsap, no_sap, no_snp, no_chr, nsyn, nnon, d, tot_len;\n-\tfloat non, syn, x;\n-\tdouble tot_non = 0.0, tot_syn = 0.0,\t// total sites in the intervals\n-\t factor;\n-\n-\t// process command-line arguments\n-\tif (same_string(argv[argc-1], "debug")) {\n-\t\tdebug = 1;\n-\t\t--argc;\n-\t}\n-\tif (argc < 5)\n-\t\tfatal("args: SAPs SNPs intervals individual1 ... [debug]");\n-\tfp1 = ckopen(argv[1], "r");\n-\tfp2 = ckopen(argv[2], "r");\n-\tfp3 = ckopen(argv[3], "r");\n-\tfor (i = 4; i < argc; ++i)\n-\t\tcol[i-4] = atoi(argv[i]) + 2;\n-\tnC = argc - 4;\n-\n-\t// loop over the intervals\n-\ttot_len = no_sap = nsap = no_snp = no_chr = nsyn = nnon = 0;\n-\tnon = syn = 0.0;\n-\tfor (nint = 0; get_interval(fp3); ++nint) {\n-\t\tif (strncmp(chr, "chr", 3) == 0 && !isdigit(chr[3])) {\n-\t\t\t++no_chr;\n-\t\t\tcontinue;\n-\t\t}\n-\t\ttot_len += (end - beg);\n-\t\t// expand e.g. .333 to .3333333..\n-\t\ttot_non += grow(all_non);\n-\t\ttot_syn += grow(all_syn);\n-\n-\t\t// skip SAPs coming before this interval\n-\t\twhile (before(nbr_SAP, pos_SAP, nbr_interv, beg))\n-\t\t\tget_SAP(fp1);\n-\t\t// loop over SAPs in this inteval\n-\t\twhile (before(nbr_SAP, pos_SAP, nbr_interv, end)) {\n-\t\t\t++nsap;\n-\n-\t\t\t// look for this SAP in the SNP file\n-\t\t\twhile (before(nbr_SNP, pos_SNP, nbr_SAP, pos_SAP)) {\n-\t\t\t\tif (nbr_SNP == nbr_interv && pos_SNP >= beg)\n-\t\t\t\t\t++no_sap;\n-\t\t\t\tget_SNP(fp2);\n-\t\t\t}\n-\n-\t\t\t// is this the SAP we looked for?\n-\t\t\tif (nbr_SAP == nbr_SNP && pos_SAP == pos_SNP) {\n-\t\t\t\td = diff_pair();\n-\t\t\t\tif (A[0] == B[0]) {\n-\t\t\t\t\t++nsyn;\n-\t\t\t\t\tsyn += (float)d;\n-\t\t\t\t} else {\n-\t\t\t\t\t++nnon;\n-\t\t\t\t\tnon += (float)d;\n-\t\t\t\t}\n-\t\t\t\tget_SNP(fp2);\n-\t\t\t} else\n-\t\t\t\t++no_snp;\n-\t\t\tget_SAP(fp1);\n-\t\t}\n-\t\t// process SNPs in the interval but not in the SAP file\n-\t\twhile (before(nbr_SNP, pos_SNP, nbr_interv, end)) {\n-\t\t\tif (nbr_SNP == nbr_interv && pos_SNP >= beg)\n-\t\t\t\t++no_sap;\n-\t\t\tget_SNP(fp2);\n-\t\t}\n-\t}\n-\n-\t// there are x = (2*nC choose 2) pairs of chromosomes\n-\tx = (float)(nC*(2*nC-1));\n-\tnon /= x;\n-\tsyn /= x;\n-\tprintf("%d intervals\\n", nint);\n-\tif (no_chr)\n-\t\tprintf("ignored %d interval%s on unnumbered chromosomes, like chrX\\n",\n-\t\t no_chr, no_chr == 1 ? "" : "s");\n-\tprintf("%d SNPs, %d nonsyn, %d synon\\n", nsap, nnon, nsyn);\n-\tif (no_sap)\n-\t\tprintf("%d SNPs in an interval are not in the SAP table\\n",\n-\t\t no_sap);\n-\tif (no_snp)\n-\t\tprintf("%d SNPs in an interval are not in the SNP table\\n",\n-\t\t no_snp);\n-\tprintf("nonsynon: %4.3f/%4.3f = %6.5f%%\\n",\n-\t non, tot_non, 100.0*non/tot_non);\n-\tprintf("synon: %4.3f/%4.3f = %6.5f%%\\n",\n-\t syn, tot_syn, 100.0*syn/tot_syn);\n-\tfor (factor = 0.0, i = 1; i < 2*nC; ++i)\n-\t\tfactor += (1.0/(double)i);\n-\tfactor *= (double)tot_len/100.0;\n-\tprintf("%d covered bp, thetaNon = %6.5f%%, thetaSyn = %6.5f%%\\n",\n-\t tot_len, (double)nnon/factor, (double)nsyn/factor);\n-\treturn 0;\n-}\n' |
b |
diff -r 4188853b940b -r a631c2f6d913 genome_diversity/src/lib.c --- a/genome_diversity/src/lib.c Fri Jul 26 12:51:13 2013 -0400 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 |
b |
@@ -1,71 +0,0 @@ -// lib.c -- a little library of C procudures - -#include "lib.h" - -char *argv0; - -/* print_argv0 ---------------------------------------- print name of program */ -void print_argv0(void) -{ - if (argv0) { - char *p = strrchr(argv0, '/'); - (void)fprintf(stderr, "%s: ", p ? p+1 : argv0); - } -} - -/* fatal ---------------------------------------------- print message and die */ -void fatal(const char *msg) -{ - fatalf("%s", msg); -} - -/* fatalf --------------------------------- format message, print it, and die */ -void fatalf(const char *fmt, ...) -{ - va_list ap; - va_start(ap, fmt); - fflush(stdout); - print_argv0(); - (void)vfprintf(stderr, fmt, ap); - (void)fputc('\n', stderr); - va_end(ap); - exit(1); -} - -/* ckopen -------------------------------------- open file; check for success */ -FILE *ckopen(const char *name, const char *mode) -{ - FILE *fp; - - if ((fp = fopen(name, mode)) == NULL) - fatalf("Cannot open %s.", name); - return fp; -} - -/* ckalloc -------------------------------- allocate space; check for success */ -void *ckalloc(size_t amount) -{ - void *p; - - if ((long)amount < 0) /* was "<= 0" -CR */ - fatal("ckalloc: request for negative space."); - if (amount == 0) - amount = 1; /* ANSI portability hack */ - if ((p = malloc(amount)) == NULL) - fatalf("Ran out of memory trying to allocate %lu.", - (unsigned long)amount); - return p; -} - -/* same_string ------------------ determine whether two strings are identical */ -bool same_string(const char *s, const char *t) -{ - return (strcmp(s, t) == 0); -} - -/* copy_string ---------------------- save string s somewhere; return address */ -char *copy_string(const char *s) -{ - char *p = ckalloc(strlen(s)+1); /* +1 to hold '\0' */ - return strcpy(p, s); -} |
b |
diff -r 4188853b940b -r a631c2f6d913 genome_diversity/src/lib.h --- a/genome_diversity/src/lib.h Fri Jul 26 12:51:13 2013 -0400 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 |
b |
@@ -1,33 +0,0 @@ -// lib.h -- header file for some useful procedures - -#include <stdio.h> -#include <stdlib.h> -#include <string.h> -#include <ctype.h> -#include <limits.h> /* INT_MAX, INT_MIN, LONG_MAX, LONG_MIN, etc. */ -#include <stdarg.h> - -typedef unsigned char uchar; -typedef int bool; - -extern char *argv0; - -void print_argv0(void); -#ifdef __GNUC__ /* avoid some "foo might be used uninitialized" warnings */ - void fatal(const char *msg) __attribute__ ((noreturn)); - void fatalf(const char *fmt, ...) __attribute__ ((noreturn)); - void fatalfr(const char *fmt, ...) __attribute__ ((noreturn)); -#else - void fatal(const char *msg); - void fatalf(const char *fmt, ...); - void fatalfr(const char *fmt, ...); -#endif -FILE *ckopen(const char *name, const char *mode); -void *ckalloc(size_t amount); -bool same_string(const char *s, const char *t); -char *copy_string(const char *s); - -#undef MAX -#define MAX(x,y) ((x) > (y) ? (x) : (y)) -#undef MIN -#define MIN(x,y) ((x) < (y) ? (x) : (y)) |
b |
diff -r 4188853b940b -r a631c2f6d913 genome_diversity/src/mito_lib.c --- a/genome_diversity/src/mito_lib.c Fri Jul 26 12:51:13 2013 -0400 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 |
[ |
@@ -1,98 +0,0 @@ -// mito_data.c -- shared procedures to read SNP and coverage file for -// mitogenomes - -#include "lib.h" -#include "mito_lib.h" - -#define ADHOC - -// get the adequately covered intervals for each specified individual; -// merge adjacent intervals -void get_intervals(char *filename) { - FILE *fp = ckopen(filename, "r"); - char buf[500], name[100]; - struct interv *p, *new; - int i, b, e, cov; - - while (fgets(buf, 500, fp)) { - if (sscanf(buf, "%*s %d %d %d %s", &b, &e, &cov, name) != 4) - fatalf("interval: %s", buf); - if (cov < min_cov) - continue; - for (i = 0; i < nM && !same_string(M[i].name, name); ++i) - ; - if (i == nM) - continue; - if (M[i].last != NULL && M[i].last->e == b) { - // merge with adjacent interval - M[i].last->e = e; - continue; - } - new = ckalloc(sizeof(*new)); - new->b = b; - new->e = e; - new->next = NULL; - if ((p = M[i].last) == NULL) - M[i].intervals = new; - else - p->next = new; - M[i].last = new; - } - fclose(fp); -/* - for (i = 0; i < nM; ++i) { - printf("%s:", M[i].name); - for (p = M[i].intervals; p; p = p->next) - printf(" %d-%d", p->b, p->e); - putchar('\n'); - } -*/ -} - -// get the SNPs; for each SNP set the array of (first characters from) -// genotypes of the specified samples (individuals) -int get_variants(char *filename, struct snp *S, int refcol) { - FILE *fp = ckopen(filename, "r"); - char buf[5000], *s, *f[101], *z = " \t\n\0"; - int i, n, c; - - for (i = 0; i <= 100; ++i) - f[i] = NULL; - for (n = 0; fgets(buf, 500, fp); ++n) { - if (buf[0] == '#') { - --n; - continue; - } - if (n >= MAX_SNP) - fatal("too many SNPs"); - if (sscanf(buf, "%*s %d", &(S[n].pos)) != 1) - fatalf("pos : %s", buf); - S[n].g = ckalloc((nM+1)*(sizeof(char))); - S[n].g[nM] = '\0'; - for (i = 0; i <= 100; ++i) - if (f[i] != NULL) - free(f[i]); - for (i = 1, s = strtok(buf, z); s; s = strtok(NULL, z), ++i) - f[i] = copy_string(s); - for (i = 0; i < nM; ++i) { - // genotype is 2 columns past the individual's 1st - // column - // AD HOC RULE: IF THERE IS ONE READ OF EACH ALLELE, - // THEN IGNORE THE SNP. - c = M[i].col; - if (refcol == 0) - S[n].g[i] = f[c+2][0]; - else if (same_string(f[refcol+2], f[c+2])) - S[n].g[i] = '2'; - else - S[n].g[i] = '0'; -#ifdef ADHOC - if (same_string(f[c], "1") && - same_string(f[c+1], "1")) - S[n].g[i] = '-'; -#endif - } - } - fclose(fp); - return n; -} |
b |
diff -r 4188853b940b -r a631c2f6d913 genome_diversity/src/mito_lib.h --- a/genome_diversity/src/mito_lib.h Fri Jul 26 12:51:13 2013 -0400 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 |
[ |
@@ -1,31 +0,0 @@ -// mito_data.h -- header file for shared procedures to read SNPs and intervals -// for mitogenomes - -#define MAX_SNP 20000 -#define MAX_SAMPLE 100 -struct snp { - int pos; - char *g; // genotypes - one character per specified mitogenome -} S[MAX_SNP], I[MAX_SNP]; - -// intervals associated with each specified mitogenome -struct interv { - int b, e; - struct interv *next; -}; -int nM, min_cov, debug; - -// mitogenomes -struct mito { - char *name; - int col; // first column in the SNP table - struct interv *intervals, *last; -} M[MAX_SAMPLE]; - -// get the adequately covered intervals for each specified individual; -// merge adjacent intervals -void get_intervals(char *filename); - -// get the SNPs; for each SNP set the array of (first characters from) -// genotypes of the specified samples (individuals) -int get_variants(char *filename, struct snp *S, int refcol); |
b |
diff -r 4188853b940b -r a631c2f6d913 genome_diversity/src/mk_Ji.c --- a/genome_diversity/src/mk_Ji.c Fri Jul 26 12:51:13 2013 -0400 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 |
[ |
@@ -1,147 +0,0 @@ -/* mk_Ji -- prepare data for drawing a picture of mitogenome data -* -* argv[1] -- SNP table for the mitogenome -* -* argv[2] -- indel table for the mitogenome -* -* argv[3] -- coverage table: file of intervals with lines like: - -P.ingens-mt 175 194 6 9-M-352 - -* giving genome name, start postion (base-0), end position (half open), -* coverage and sample name. -* -* argv[4] -- annotation file like - -P.ingens-mt 0 70 tRNA + tRNA-Phe -P.ingens-mt 70 1030 rRNA + 12S -P.ingens-mt 1030 1100 tRNA + tRNA-Val -P.ingens-mt 1101 2680 CDS + rRNA -P.ingens-mt 1101 2680 rRNA + 16S -P.ingens-mt 2680 2755 tRNA + tRNA-Leu -P.ingens-mt 2758 3713 CDS + ND1 -... -P.ingens-mt 15484 16910 D-loop + D-loop - -* argv[5] -- the minimum coverage. Intervals of lower coverage are ignored. -* -* argv[6] -- either the string "default" or the name of an individual -* -* argv[7], argv[8], ... column:name pairs like "9:sam". -* -* Also, if the last argument is "debug", then much output sent to stderr. -*/ - -#include "lib.h" -#include "mito_lib.h" - -int ref_len; - -// read gene annotation, change "CDS" to "gene", print for the drawing tool, -// print lines showing the genome name and length (last annotated position). -void get_genes(char *filename) { - FILE *fp = ckopen(filename, "r"); - char buf[500], ref[100], type[100], name[100], *t; - int b, e; - - while (fgets(buf, 500, fp)) { - if (sscanf(buf, "%s %d %d %s %*s %s", - ref, &b, &e, type, name) != 5) - fatalf("gag: %s", buf); - t = (same_string(type, "CDS") ? "gene" : type); - // print the Genome Annotation line - printf("@GA=%d:%d:%s:%s\n", b, e, name, t); - } - printf("@GL=%d\n", ref_len = e); - printf("@GN=%s\n", ref); -} - -// print items that are adequately covered -void visible(int i, struct snp *S, int nS, char *s) { - struct interv *t; - int j; - - for (j = 0, t = M[i].intervals; j < nS; ++j) { - while (t && t->e <= S[j].pos) - t = t->next; - if (t && t->b <= S[j].pos && S[j].g[i] == '0') - printf(" %s%d", s, S[j].pos); - } -} - -int main(int argc, char **argv) { - struct interv *t; - int i, nS, nI, last_e, refcol; - char *a, *s; - - if (argc > 7 && same_string(argv[argc-1], "debug")) { - --argc; - debug = 1; - } - - if (argc < 7) - fatal("args: snps indels intervals genes min_cov ref 9:sam 13:judy ... "); - - // store sample names and start positions (argv[6], argv[7], ...) - for (nM = 0, i = 7; i < argc; ++nM, ++i) { - if (nM >= MAX_SAMPLE) - fatalf("Too many mitogenomes"); - if ((s = strchr(a = argv[i], ':')) == NULL) - fatalf("colon: %s", a); - M[nM].col = atoi(a); - M[nM].name = copy_string(s+1); - } - if (same_string(argv[6], "default")) - refcol = 0; - else { - for (i = 0; i < nM && !same_string(argv[6], M[i].name); ++i) - ; - if (i == nM) - fatalf("improper reference name '%s'", argv[6]); - refcol = M[i].col; - // fprintf(stderr, "refcol = %d\n", refcol); - } - - // read annotation and annotate in the file for drawing - get_genes(argv[4]); - - // record color information - printf("@CL=rRNA:#EF8A62\n@CL=tRNA:#31A354\n@CL=gene:#B2182B\n"); - printf("@CL=missing:#67A9CF:L\n@CL=indel:#2166AC\n@CL=special:#000000\n"); - - min_cov = atoi(argv[5]); - - // store the coverage data - get_intervals(argv[3]); - - if (debug) { - for (i = 0; i < nM; ++i) { - fprintf(stderr, ">%s\n", M[i].name); - for (t = M[i].intervals; t; t = t->next) - fprintf(stderr, "%d\t%d\n", t->b, t->e); - } - putc('\n', stderr); - } - - // record the variants - nS = get_variants(argv[1], S, refcol); - nI = get_variants(argv[2], I, refcol); - - // report the information for each sample - for (i = 0; i < nM; ++i) { - printf("%s", M[i].name); - visible(i, S, nS, ""); - visible(i, I, nI, "indel="); - last_e = 0; - for (t = M[i].intervals; t; t = t->next) { - if (last_e < t->b) - printf(" missing=%d:%d", last_e, t->b); - last_e = t->e; - } - if (last_e < ref_len) - printf(" missing=%d:%d", last_e, ref_len); - putchar('\n'); - } - - return 0; -} |
b |
diff -r 4188853b940b -r a631c2f6d913 genome_diversity/src/mt_pi.c --- a/genome_diversity/src/mt_pi.c Fri Jul 26 12:51:13 2013 -0400 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 |
[ |
@@ -1,164 +0,0 @@ -/* mt_pi -- compute the diversity measure pi for mitochondrial genomes -* [SHOULD I OPTIONALLY INCLUDE INDELS?] -* -* argv[1] -- SNP table for the mitogenome -* -* argv[2] -- file of intervals with lines like: - -P.ingens-mt 175 194 6 9-M-352 - -* giving genome name, start postion (base-0), end position (half open), -* coverage and sample name. -* -* argv[3] -- the minimum coverage. Intervals of lower coverage are ignored. -* -* argv[4], argv[5], ... column:name pairs like "9:sam". -* -* Also, if the last argument is "debug", then much output sent to stderr, if it -* is "debug2", then the coverage and difference-count for each mitogenome-pair -* is sent to stderr. -*/ - -#include "lib.h" -#include "mito_lib.h" - -int debug2; - -// for a pair of samples, determine how much of the reference is in the -// adequately covered segments for both, and count the number of SNPs in those -// shared regions where they differ -// PUTATIVE HETEROPLASMIES ARE IGNORED -float pair(int i, int j, int nS) { - int covered, B, E, diffs, k; - struct interv *p = M[i].intervals, *q = M[j].intervals; - char x, y; - - // k scans the SNPs - covered = diffs = k = 0; - while (p && q) { - if (debug) - fprintf(stderr, "trying %d-%d and %d-%d\n", - p->b, p->e, q->b, q->e); - // take the intersection of the two well-covered intervals - B = MAX(p->b, q->b); - E = MIN(p->e, q->e); - if (B < E) { - if (debug) - fprintf(stderr, " covered %d\n", E-B); - covered += (E - B); - for ( ; k < nS && S[k].pos < E; ++k) { - if (S[k].pos >= B) { - x = S[k].g[i]; - y = S[k].g[j]; - if (debug) - fprintf(stderr, - " SNP %c %c at %d\n", - x, y, S[k].pos); -/* - if (x == '-' || y == '-') - fatalf("SNP at %d missing geno", - S[k].pos); -*/ -/* - if (x == '1' || y == '1') - continue; -*/ - if (x != y) { - ++diffs; - if (debug) - fprintf(stderr, - "\tdiff at %d\n", - S[k].pos); - } - } - } - } - // go to next adequately covered interval(s) - if (p->e < q->e) - p = p->next; - else if (p->e > q->e) - q = q->next; - else { - p = p->next; - q = q->next; - } - } - if (debug2) - fprintf(stderr, "cov(%s,%s) = %d, diffs = %d\n", - M[i].name, M[j].name, covered, diffs); -/* - if (covered == 0) - fatalf("coverage threshold is too high for %s and %s", - M/[i].name, M[j].name); -*/ - if (covered == 0) - return -1.0; - return (float)diffs/(float)covered; -} - -int main(int argc, char **argv) { - struct interv *t; - int i, j, nS, good_pairs, bad_pairs; - char *a, *s; - float tot, pr; - - if (argc > 4 && same_string(argv[argc-1], "debug")) { - --argc; - debug = debug2 = 1; - } else if (argc > 4 && same_string(argv[argc-1], "debug2")) { - --argc; - debug2 = 1; - } - - if (argc < 5) - fatal("args: snps intervals min_cov 9:sam 13:judy ... "); - // store sample names and start positions (argv[4], argv[5], ...) - for (nM = 0, i = 4; i < argc; ++nM, ++i) { - if (nM >= MAX_SAMPLE) - fatalf("Too many mitogenomes"); - if ((s = strchr(a = argv[i], ':')) == NULL) - fatalf("colon: %s", a); - M[nM].col = atoi(a); - M[nM].name = copy_string(s+1); - } - min_cov = atoi(argv[3]); - get_intervals(argv[2]); - - if (debug) { - for (i = 0; i < nM; ++i) { - fprintf(stderr, ">%s\n", M[i].name); - for (t = M[i].intervals; t; t = t->next) - fprintf(stderr, "%d\t%d\n", t->b, t->e); - } - putc('\n', stderr); - } - - // record the SNPs - nS = get_variants(argv[1], S, 0); - - if (debug) { - for (i = 0; i < nS; ++i) - fprintf(stderr, "%d %s\n", S[i].pos, S[i].g); - putc('\n', stderr); - } - - // record the total rate of diversity, over all pairs of individuals - // having overlapping well-covered intervals - good_pairs = bad_pairs = 0; - for (i = 0, tot = 0.0; i < nM; ++i) { - for (j = i+1; j < nM; ++j) { - pr = pair(i, j, nS); - if (pr >= 0.0) { - ++good_pairs; - tot += pr; - } else - ++bad_pairs; - } - } - printf("pi = %5.5f\n", tot/(float)good_pairs); - if (bad_pairs > 0) - printf("%d of %d pairs had no sequenced regions in common\n", - bad_pairs, bad_pairs + good_pairs); - - return 0; -} |
b |
diff -r 4188853b940b -r a631c2f6d913 genome_diversity/src/sweep.c --- a/genome_diversity/src/sweep.c Fri Jul 26 12:51:13 2013 -0400 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 |
[ |
b'@@ -1,302 +0,0 @@\n-/* sweep -- find regions of the genome with high scores (e.g., Fst scores).\n-*\n-* argv[1] -- file containing a Galaxy table\n-* argv[2] -- column number for the chromosome name (column numbers base-1)\n-* argv[3] -- column number for the chromosomal position\n-* argv[4] -- column number for a score for the position\n-* argv[5] -- a percentage, such as "95", or a raw score, such as "=0.9".\n-* argv[6] -- the number of randomizations (shuffles) of the scores\n-* argv[7] -- [optional] if present and non-zero, report SNPs\n-*\n-* The program first determines a threshold such that the stated percentage\n-* of the scores are below that threshold (or uses the provided number if\n-* argv[5] starts with "="). The program subtracts the threshold\n-* from each score, then looks for maximal-scoring runs of SNPs, i.e., where\n-* adding or subtracting SNPs from an end of then run always decreases the\n-* total score. These regions are printed in order of descreasing total score.\n-* To determine a cutoff for the printed regions, the programs takes the maximum\n-* score over all regions observed in a specified number of shuffles of the\n-* list of scores. If argv[6] = 0, then all maximal-scoring runs of at least\n-* 4 table entries are printed.\n-\n-What it does on Galaxy\n-The user selects a SNP table and specifies the columns containing (1) chromosome, (2) position, (3) scores (such as an Fst-value for the SNP), (4) a percentage or raw score for the "cutoff" and (5) the number of times the data should be radomized (only intervals with score exceeding the maximum for the randomized data are reported). If a percentage (e.g. 95%) is specified for #3, then that percentile of the scores is used as the cutoff; this may not work well if many SNPs have the same score. The program subtracts the cutoff from every score, then finds genomic intervals (i.e., consecutive runs of SNPs) whose total score cannot be increased by adding or subtracting one or more SNPs at the ends of the interval.\n-*/\n-\n-#include "lib.h"\n-#include "Huang.h"\n-\n-// maximum number of rows in any processed table\n-#define MANY 20000000\n-#define BUF_SIZE 50000\n-#define MAX_WINDOW 1000000\n-\n-double X[MANY];\t// holds all scores\n-int nX;\n-\n-// position-score pairs for a single chromosome\n-struct score {\n-\tint pos;\n-\tdouble x; // original score, then shifted score\n-} S[MANY];\n-int nS;\n-\n-struct snp {\n-\tint pos;\n-\tdouble x;\n-\tstruct snp *next;\n-};\n-\n-// structure to hold the maximum-scoring chromosomal intervals\n-struct sweep {\n-\tfloat score;\n-\tchar *chr;\n-\tint b, e;\n-\tstruct snp *snps;\n-} W[MAX_WINDOW];\n-int nW;\n-\n-#define MAX_CHR_NAME 1000000\n-char *chr_name[MAX_CHR_NAME];\n-int nchr_name;\n-\n-// return the linked list of SNPs in positions b to e\n-struct snp *add_snps(int b, int e) {\n-\tstruct snp *first = NULL, *last = NULL, *new;\n-\tint i;\n-\tfor (i = b; i <= e; ++i)\n-\t\tif (S[i].pos >= 0) {\n-\t\t\tnew = ckalloc(sizeof(*new));\n-\t\t\tnew->pos = S[i].pos;\n-\t\t\tnew->x = S[i].x;\n-\t\t\tnew->next = NULL;\n-\t\t\tif (first == NULL)\n-\t\t\t\tfirst = new;\n-\t\t\telse\n-\t\t\t\tlast->next = new;\n-\t\t\tlast = new;\n-\t\t}\n-\treturn first;\n-}\n-\n-// given a table row, return a pointer to the item in a particular column\n-char *get_col(char *buf, int col) {\n-\tstatic char temp[BUF_SIZE], *p;\n-\tint i;\n-\tchar *z = " \\t\\n";\n-\n-\tstrcpy(temp, buf);\n-\tfor (p = strtok(temp, z), i = 1; *p && i < col;\n-\t p = strtok(NULL, z), ++i)\n-\t\t;\n-\tif (p == NULL)\n-\t\tfatalf("no column %d in %s", col, buf);\n-\treturn p;\n-}\n-\n-// fill S[] with position-score pairs for the next chromosome\n-// return 0 for EOF\n-int get_chr(FILE *fp, int chr_col, int pos_col, int score_col, char *chr) {\n-\tstatic char buf[BUF_SIZE];\n-\tstatic int init = 1;\n-\tint old_pos = 0, p, i;\n-\tchar *status, *s;\n-\n-\tif (init) {\n-\t\twhile ((status = fgets(buf, BUF_SIZE, fp)) != NULL &&\n-\t\t buf[0] == \'#\')\n-\t\t\t;\n-\t\tif (status == NULL)\n-\t\t\tfatal("empty table");\n-\t\tinit = 0;\n-\t}\n-\tif (buf[0] == \'\\0\')\n-\t\treturn 0;\n-\t\n-\tif (buf[0] == \'#\')\n-\t\tfatal("cannot happen");\n-\tstrcpy(ch'..b'"Too many chromosome names");\n-\tfor (i = 0; i < nchr_name && !same_string(s, chr_name[i]); ++i)\n-\t\t;\n-\tif (i < nchr_name)\n-\t\tfatalf("SNVs on %s aren\'t together", s);\n-\tchr_name[nchr_name++] = copy_string(s);\n-\n-\treturn 1;\n-}\n-\n-// for sorting genomic intervals by *decreasing* score\n-int Wcompar(struct sweep *a, struct sweep *b) {\n-\tfloat y = a->score, z = b->score;\n-\n-\tif (y > z)\n-\t\treturn -1;\n-\tif (y < z)\n-\t\treturn 1;\n-\treturn 0;\n-}\n-\n-// for sorting an array of scores into increasing order\n-int fcompar(double *a, double *b) {\n-\tif (*a < *b)\n-\t\treturn -1;\n-\tif (*a > *b)\n-\t\treturn 1;\n-\treturn 0;\n-}\n-\n-/* shuffle the values S[0], S[1], ... , S[nscores-1];\n-* Uses Algorithm P in page 125 of "The Art of Computer Programming (Vol II)\n-* Seminumerical Programming", by Donald Knuth, Addison-Wesley, 1971.\n-*/\n-void shuffle_scores() {\n-\tint i, j;\n-\tdouble temp;\n-\n-\tfor (i = nX-1; i > 0; --i) {\n-\t\t// swap what\'s in location i with location j, where 0 <= j <= i\n-\t\tj = random() % (i+1);\n-\t\ttemp = X[i];\n-\t\tX[i] = X[j];\n-\t\tX[j] = temp;\n-\t}\n-}\n-\n-// return the best interval score (R[i] is the struct operated by Huang())\n-double best() {\n-\tint i;\n-\tdouble bestScore;\n-\n-\tHuang(X, nX);\n-\n-\tfor (bestScore = 0.0, i = 1; i <= top; ++i) \n-\t\tbestScore = MAX(R[i].Score, bestScore);\n-\treturn bestScore;\n-}\n-\n-int main(int argc, char **argv) {\n-\tFILE *fp;\n-\tchar buf[BUF_SIZE], chr[100], *a;\n-\tdouble shift = 0.0, cutoff;\n-\tint i, b, e, chr_col, pos_col, score_col, nshuffle, snps = 0;\n-\tstruct snp *s;\n-\n-\tif (argc != 7 && argc != 8)\n-\t\tfatal("args: table chr_col pos_col score_col threshold randomizations [SNPs]");\n-\n-\t// process command-line arguments\n-\tchr_col = atoi(argv[2]);\n-\tpos_col = atoi(argv[3]);\n-\tscore_col = atoi(argv[4]);\n-\ta = argv[5];\n-\tfp = ckopen(argv[1], "r");\n-\tif (argc == 8)\n-\t\tsnps = atoi(argv[7]);\n-\tif (isdigit(a[0])) {\n-\t\tfor (nX = 0; nX < MANY && fgets(buf, BUF_SIZE, fp); ) {\n-\t\t\tif (buf[0] == \'#\') \n-\t\t\t\tcontinue;\n-\t\t\tX[nX++] = atof(get_col(buf, score_col));\n-\t\t}\n-\t\tif (nX == MANY)\n-\t\t\tfatal("Too many rows");\n-\t\tqsort((void *)X, (size_t)nX, sizeof(double),\n-\t\t (const void *)fcompar);\n-\t\tshift = X[atoi(a)*nX/100];\n-\t\trewind(fp);\n-\t} else if (a[0] == \'=\')\n-\t\tshift = atof(a+1);\n-\n-//fprintf(stderr, "shift = %4.3f\\n", shift);\n-\tnshuffle = atoi(argv[6]);\n-\tif (nshuffle == 0)\n-\t\tcutoff = 0;\n-\telse {\n-\t\tfor (nX = 0; nX < MANY && fgets(buf, BUF_SIZE, fp); ) { \n-\t\t\tif (buf[0] == \'#\')\n-\t\t\t\tcontinue;\n-\t\t\tX[nX++] = atof(get_col(buf, score_col)) - shift;\n-\t\t}\n-\t\tif (nX == MANY)\n-\t\t\tfatal("Too many rows");\n-\t\tfor (cutoff = 0.0, i = 0; i < nshuffle; ++i) {\n-\t\t\tshuffle_scores();\n-\t\t\tcutoff = MAX(cutoff, best());\n-\t\t}\n-\t\trewind(fp);\n-\t}\n-//fprintf(stderr, "cutoff = %4.3f\\n", cutoff);\n-\n-\t// loop over chromosomes;\n-\t// start by getting the chromosome\'s scores\n-\twhile (get_chr(fp, chr_col, pos_col, score_col, chr)) {\n-\t\t// subtract shift from the scores\n-\t\tfor (i = 0; i < nS; ++i)\n-\t\t\tX[i] = S[i].x - shift;\n-\n-\t\t// find the maximum=scoring regions\n-\t\tHuang(X, nS);\n-\t\n-\t\t// save any regions with >= 4 points and score >= cutoff\n-\t\tfor (i = 0; i <= top; ++i) {\n-\t\t\tif (nW >= MAX_WINDOW)\n-\t\t\t\tfatalf("too many windows");\n-\n-\t\t\t// get indices of the first and last SNP in the interval\n-\t\t\tb = R[i].Lpos + 1;\n-\t\t\te = R[i].Rpos;\n-\n-\t\t\t// remove unmapped SNP position from intervals\' ends\n-\t\t\twhile (b < e && S[b].pos == -1)\n-\t\t\t\t++b;\n-\t\t\twhile (e > b && S[e].pos == -1)\n-\t\t\t\t--e;\n-\n-\t\t\t// record intervals\n-\t\t\tif (e - b < 3 || R[i].Score < cutoff)\n-\t\t\t\tcontinue;\n-\t\t\tW[nW].score = R[i].Score;\n-\t\t\tW[nW].chr = copy_string(chr);\n-\t\t\tW[nW].b = S[b].pos;\n-\t\t\tW[nW].e = S[e].pos+1;\t// Ws are half-open\n-\t\t\tif (snps)\n-\t\t\t\tW[nW].snps = add_snps(b, e);\n-\t\t\t++nW;\n-\t\t}\n-\t}\n-\n-\t// sort by decreasing score\n-\tqsort((void *)W, (size_t)nW, sizeof(W[0]), (const void *)Wcompar);\n-\n-\tfor (i = 0; i < nW; ++i) {\n-\t\tprintf("%s\\t%d\\t%d\\t%4.4f\\n", \n-\t\t\tW[i].chr, W[i].b, W[i].e, W[i].score);\n-\t\tfor (s = W[i].snps; s; s = s->next)\n-\t\t\tprintf(" %d %3.2f\\n", s->pos, s->x);\n-\t}\n-\treturn 0;\n-}\n' |
b |
diff -r 4188853b940b -r a631c2f6d913 inbreeding_and_kinship.py --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/inbreeding_and_kinship.py Fri Sep 20 13:25:27 2013 -0400 |
[ |
@@ -0,0 +1,29 @@ +#!/usr/bin/env python + +import sys +import gd_util + +################################################################################ + +if len(sys.argv) != 6: + gd_util.die('Usage') + +ped_input, ind_input, computed_value, output, kinship_input = sys.argv[1:] + +################################################################################ + +prog = 'inbreed' + +args = [ prog ] +args.append(ped_input) # pedigree +args.append(ind_input) # specified individuals (e.g.,,potential breeding population) +args.append(kinship_input) # kinships of founders +args.append(computed_value) # 0 = inbreedng coefficients, 1 = kinships, 2 = mean kinships + +with open(output, 'w') as fh: + gd_util.run_program(prog, args, stdout=fh) + +################################################################################ + +sys.exit(0) + |
b |
diff -r 4188853b940b -r a631c2f6d913 inbreeding_and_kinship.xml --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/inbreeding_and_kinship.xml Fri Sep 20 13:25:27 2013 -0400 |
b |
@@ -0,0 +1,133 @@ +<tool id="gd_inbreeding_and_kinship" name="Inbreeding and kinship" version="1.0.0"> + <description>: Analyze the pedigree without genomic data</description> + + <command interpreter="python"> + inbreeding_and_kinship.py '$ped_input' '$ind_input' '$computed_value' '$output' + #if $kinship_dataset.choice == '0' + '/dev/null' + #else if $kinship_dataset.choice == '1' + '$kinship_input' + #end if + </command> + + <inputs> + <param name="ped_input" type="data" format="txt" label="Pedigree dataset" /> + <param name="ind_input" type="data" format="txt" label="Individuals dataset" /> + <conditional name="kinship_dataset"> + <param name="choice" type="select" format="integer" label="Kinship dataset"> + <option value="0" selected="true">no kinship dataset</option> + <option value="1">select kinship dataset</option> + </param> + <when value="0" /> + <when value="1"> + <param name="kinship_input" type="data" format="txt" label="Kinship dataset" /> + </when> + </conditional> + <param name="computed_value" type="select" format="integer" label="Computed value"> + <option value="0" selected="true">inbreeding coeffiecients</option> + <option value="1">kinships</option> + <option value="2">mean kinships</option> + </param> + </inputs> + + <outputs> + <data name="output" format="txt" /> + </outputs> + + <requirements> + <requirement type="package" version="0.1">gd_c_tools</requirement> + </requirements> + + <!-- + <tests> + </tests> + --> + + <help> + +**Dataset formats** + +The input datasets are in text_ format. +The output dataset is in text_ format. + +.. _text: ./static/formatHelp.html#text + +----- + +**What it does** + +The user specifies a pedigree. This is done with a Galaxy table with one +row per individual, containing (1) the individual's name, (2) the name of +one of the individual's parents, which must have occurred at the start +of a previous line, and (3) the name of the individual's other parent, +which occurred at the start of a previous line. For a pedigree founder, +each parent name is replaced by "-". + +The user also provides a file that specifies a set of names of individuals +(specifically the first word on each line (one line per individual); +any subsequent information on a line is ignored. + +The user can optionally provide a file giving kinship information for +each pair of distinct individuals from the founder set. + +Finally the user picks from among the options: + + 1. inbreeding coefficients for each specified individual + 2. the kinship for each pair of distinct specified individual + 3. the mean kinship for each specified individual, i.e., the average kinship value for that individual and every specified individual + +The command reports the requested values. + +----- + +**Example** + +- input:: + + A - - + B - - + C - - + D - - + E - - + F A B + G A B + Thelma A F + Louise F G + +Rows can have more than three columns (such as the individual's sex), +but only the first three columns affect this command. + +Suppose on the other hand that we select an alternative +"founder" set, {A, F, G}. (We require a founder sets to have a +member on any ancestral path from Thelma or Louise.) The above pedigree +file is then replaced by:: + + A - - + F - - + G - - + Thelma A F + Louise F G + +The user then also provides a file giving kinship information for each +pairs of distinct individuals from the founder set; for the current +example, the kinship file is as follows:: + + A F 0.25 + A G 0.25 + F G 0.25 + +since parent-child pairs and siblings both have kinship 0.25. The +advantage is that this capability can be used in cases where the kinships +of the founders are not initially known, but instead are computationally +predicted, e.g., with the Galaxy "Discover" tool. + </help> +</tool> + + + + + + + + + |
b |
diff -r 4188853b940b -r a631c2f6d913 make_phylip.py --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/make_phylip.py Fri Sep 20 13:25:27 2013 -0400 |
[ |
b'@@ -0,0 +1,511 @@\n+#!/usr/bin/env python\n+# -*- coding: utf-8 -*-\n+#\n+# mkFastas.py\n+# \n+# Copyright 2013 Oscar Reina <oscar@niska.bx.psu.edu>\n+# \n+# This program is free software; you can redistribute it and/or modify\n+# it under the terms of the GNU General Public License as published by\n+# the Free Software Foundation; either version 2 of the License, or\n+# (at your option) any later version.\n+# \n+# This program is distributed in the hope that it will be useful,\n+# but WITHOUT ANY WARRANTY; without even the implied warranty of\n+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the\n+# GNU General Public License for more details.\n+# \n+# You should have received a copy of the GNU General Public License\n+# along with this program; if not, write to the Free Software\n+# Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston,\n+# MA 02110-1301, USA.\n+\n+import argparse\n+import errno\n+import os\n+import shutil\n+\n+def mkdir_p(path):\n+ try:\n+ os.makedirs(path)\n+ except OSError, e:\n+ if e.errno <> errno.EEXIST:\n+ raise\n+\n+def revseq(seq):\n+ seq=list(seq)\n+ seq.reverse()\n+ seq=\'\'.join(seq)\n+ return seq\n+ \n+def revComp(allPop):\n+ dAllCompAll={\'A\':\'T\',\'T\':\'A\',\'C\':\'G\',\'G\':\'C\',\'N\':\'N\',\'M\':\'K\',\'K\':\'M\',\'R\':\'Y\',\'Y\':\'R\',\'W\':\'W\',\'S\':\'S\'}\n+ allPopsComp=dAllCompAll[allPop]\n+ return allPopsComp\n+\n+def rtrnCons(ntA,ntB):\n+ srtdPairs=\'\'.join(sorted([ntA,ntB]))\n+ dpairsCons={\'AC\':\'M\', \'AG\':\'R\', \'AT\':\'W\', \'CG\':\'S\', \'CT\':\'Y\', \'GT\':\'K\', \'AN\':\'A\', \'CN\':\'C\', \'GN\':\'G\', \'NT\':\'T\'}\n+ cons=dpairsCons[srtdPairs]\n+ return cons\n+\n+def rtrnFxdChrPos(inSNPf,dPopsinSNPfPos,pxchrx,pxpos,pxntA,pxntB,fulldChrdPosdPopsAlllsInit=False,cvrgTreshold=False,indvlsPrctTrshld=False):\n+ """\n+ """\n+ dChrdPosdPopsAlllsInit={}\n+ seqref=[]\n+ for eachl in open(inSNPf,\'r\'):\n+ if eachl.strip() and eachl[0]!=\'#\':\n+ fllInfoSplt=eachl.splitlines()[0].split(\'\\t\')\n+ chrx=fllInfoSplt[pxchrx]\n+ pos=int(fllInfoSplt[pxpos])\n+ ntA=fllInfoSplt[pxntA]\n+ ntB=fllInfoSplt[pxntB]\n+ seqref.append([pos,ntA])\n+ dPopsAllls={}\n+ if fulldChrdPosdPopsAlllsInit:\n+ #~ \n+ cntIndv=0\n+ #\n+ try:\n+ fulldPopsAllls=fulldChrdPosdPopsAlllsInit[chrx][pos]\n+ except:\n+ fulldPopsAllls=dict([(echPop,ntA) for echPop in dPopsinSNPfPos]) \n+ #\n+ for eachPop in dPopsinSNPfPos:\n+ clmnCvrg=dPopsinSNPfPos[eachPop]\n+ if clmnCvrg:\n+ eachPopCvrg=int(fllInfoSplt[clmnCvrg])\n+ else:\n+ #~ eachPopCvrg=0\n+ eachPopCvrg=cvrgTreshold\n+ if eachPopCvrg>=cvrgTreshold:\n+ dPopsAllls[eachPop]=fulldPopsAllls[eachPop]\n+ cntIndv+=1\n+ else:\n+ dPopsAllls[eachPop]=\'N\' \n+ #~ \n+ if indvlsPrctTrshld>(cntIndv/float(len(dPopsinSNPfPos))):\n+ dPopsAllls=dict([(echPop,\'N\') for echPop in dPopsinSNPfPos])\n+ #~ \n+ else:\n+ for eachPop in dPopsinSNPfPos:\n+ if dPopsinSNPfPos[eachPop]:\n+ eachPopAll=int(fllInfoSplt[dPopsinSNPfPos[eachPop]])\n+ if eachPopAll==0:\n+ dPopsAllls[eachPop]=ntB\n+ elif eachPopAll==2:\n+ dPopsAllls[eachPop]=ntA\n+ elif eachPopAll==1:\n+ dPopsAllls[eachPop]=rtrnCons(ntA,ntB)\n+ else:\n+ dPopsAllls[eachPop]=\'N\'\n+ else:\n+ '..b" outFastaFold = './out'\n+ files_dir = args.output_dir\n+ gd_indivs = args.gd_indivs \n+ pxchrx = args.chrClmn\n+ pxpos = args.posClmn\n+ pxntA = args.refClmn\n+ pxntB = args.altrClmn\n+ \n+ \n+ inCDSfile = args.sequence\n+ inUCSCfile = args.gene_info\n+ fchrClmn = args.fchrClmn#chromosome column\n+ txStartClmn = args.txStartClmn#transcript start column\n+ txEndClmn = args.txEndClmn#transcript end column\n+ strandClmn = args.strandClmn#strand column\n+ geneNameClmn = args.geneNameClmn#gene name column\n+ cdsStartClmn = args.cdsStartClmn#coding sequence start column\n+ cdsEndClmn = args.cdsEndClmn#coding sequence end column\n+ startExsClmn = args.startExsClmn#exons start column\n+ endExsClmn = args.endExsClmn#exons end column\n+ \n+ inputCover = args.inputCover\n+ gd_indivs_cover = args.gd_indivs_cover\n+ cvrgTreshold = args.cvrgTreshold\n+ pxchrxCov = args.chrClmnCvrg\n+ pxposCov = args.posClmnCvrg\n+ pxntACov = args.refClmnCvrg\n+ pxntBCov = args.altrClmnCvrg\n+ indvlsPrctTrshld = args.indvlsPrctTrshld\n+ \n+ #print inputCover, gd_indivs_cover, cvrgTreshold\n+ \n+ assert ((inputCover and gd_indivs_cover and cvrgTreshold>=0 and indvlsPrctTrshld>=0) or (inCDSfile and inUCSCfile))\n+ \n+ #~ \n+ dPopsinSNPfPos=dict([(x.split()[1],int(x.split()[0])-1) for x in open(gd_indivs).read().splitlines() if x.strip()])\n+ #~ dPopsinSNPfPos.update({'ref':False})\n+ #~ \n+ sPopsIntrst=set(dPopsinSNPfPos.keys())\n+ dChrdPosdPopsAlllsInit,seqref,chrx,startExs,endExs=rtrnFxdChrPos(inSNPf,dPopsinSNPfPos,pxchrx,pxpos,pxntA,pxntB)#~ print '1. Getting fixed alleles information...'\n+ #~ dENSEMBLTseqChrStEnEx,dChrdStrtEndENSEMBLT=rtrndENSEMBLTseq(inCDSfile,inUCSCfile)\n+ #~\n+ if inputCover and gd_indivs_cover and cvrgTreshold>=0:\n+ dPopsinSNPfPos_cover=dict([(eachPop,False) for eachPop in dPopsinSNPfPos.keys()])\n+ dPopsinSNPfPos_cover.update(dict([(x.split()[1],int(x.split()[0])-1) for x in open(gd_indivs_cover).read().splitlines() if x.strip()]))\n+ dChrdPosdPopsAlllsInit,seqref,chrx,startExs,endExs=rtrnFxdChrPos(inputCover,dPopsinSNPfPos_cover,pxchrxCov,pxposCov,pxntACov,pxntBCov,dChrdPosdPopsAlllsInit,cvrgTreshold,indvlsPrctTrshld)\n+ rvrse=False\n+ dENSEMBLTseqChrStEnEx={'tmp':(seqref,chrx,startExs,endExs,rvrse)}\n+ dChrdStrtEndENSEMBLT={chrx:{(startExs[0],endExs[0]):'tmp'}}\n+ #~ \n+ elif inCDSfile and inUCSCfile:\n+ dENSEMBLTseqChrStEnEx,dChrdStrtEndENSEMBLT=rtrndENSEMBLTseq(inCDSfile,inUCSCfile,fchrClmn,txStartClmn,txEndClmn,strandClmn,geneNameClmn,startExsClmn,endExsClmn,cdsStartClmn,cdsEndClmn)#~ print '2. Getting transcripts and exons information...' \n+ #~ \n+ dENSEMBLTChrPosdAlls,dChrPosdPopsAllls=rtrnFxdChrPosinCodReg(dChrdStrtEndENSEMBLT,dChrdPosdPopsAlllsInit)#~ print '3. Getting fixed alleles in exons...'\n+ #~ \n+ dENSEMBLTPopsFasta,lchrStartexEndposAll=rtrnSeqVars(dENSEMBLTseqChrStEnEx,dENSEMBLTChrPosdAlls)#~ print '4. Getting fasta sequences of populations...'\n+ #~ \n+ wrapSeqsFasta(dENSEMBLTPopsFasta,outFastaFold,sPopsIntrst)\n+ #~ \n+\n+\n+ ## get a lit of output files\n+ files = []\n+ for dirpath, dirnames, filenames in os.walk(outFastaFold):\n+ for file in filenames:\n+ if file.endswith('.phy'):\n+ files.append( os.path.join(dirpath, file) )\n+ del dirnames[:]\n+\n+ if len(files) == 0:\n+ with open(outfile, 'w') as ofh:\n+ print >> ofh, 'No output.'\n+ else:\n+ ## the first file becomes the output\n+ file = files.pop(0)\n+ shutil.move(file, outfile)\n+\n+ ## rename/move the rest of the files\n+ for i, file in enumerate(files):\n+ new_filename = 'primary_{0}_output{1}_visible_txt_?'.format(outfile_id, i+2)\n+ new_pathname = os.path.join(files_dir, new_filename)\n+ shutil.move(file, new_pathname)\n+\n+ return 0\n+\n+if __name__ == '__main__':\n+ main()\n" |
b |
diff -r 4188853b940b -r a631c2f6d913 make_phylip.xml --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/make_phylip.xml Fri Sep 20 13:25:27 2013 -0400 |
b |
b'@@ -0,0 +1,178 @@\n+<tool id="gd_make_phylip" name="Phylip" version="1.0.0" force_history_refresh="True">\n+ <description>: prepare data for phylogenetic analysis</description>\n+\n+ <command interpreter="python">\n+ #set $zero_based = 1\n+ #set $gen_chrClmn = int($input.metadata.scaffold) - $zero_based\n+ #set $gen_posClmn = int($input.metadata.pos) - $zero_based\n+ #set $gen_refClmn = int($input.metadata.pos) - $zero_based + 1\n+ #set $gen_altrClmn = int($input.metadata.pos) - $zero_based + 2\n+ make_phylip.py \'--altrClmn=$gen_altrClmn\' \'--chrClmn=$gen_chrClmn\' \'--gd_indivs=$indivs_input\' \'--input=$input\' \'--output=$output1\' \'--output_id=$output1.id\' \'--output_dir=$__new_file_path__\' \'--posClmn=$gen_posClmn\' \'--refClmn=$gen_refClmn\'\n+ #if $input_type.choice == \'0\'\n+ #set $cov_chrClmn = int($input_type.coverage_input.metadata.scaffold) - $zero_based\n+ #set $cov_posClmn = int($input_type.coverage_input.metadata.pos) - $zero_based\n+ #set $cov_refClmn = int($input_type.coverage_input.metadata.pos) - $zero_based + 1\n+ #set $cov_altrClmn = int($input_type.coverage_input.metadata.pos) - $zero_based + 2\n+ \'--altrClmnCvrg=$cov_altrClmn\' \'--chrClmnCvrg=$cov_chrClmn\' \'--cvrgTreshold=$input_type.coverage_threshold\' \'--gd_indivs_cover=$indivs_input\' \'--indvlsPrctTrshld=$input_type.indivs_threshold\' \'--inputCover=$input_type.coverage_input\' \'--posClmnCvrg=$cov_posClmn\' \'--refClmnCvrg=$cov_refClmn\'\n+ #else if $input_type.choice == \'1\'\n+ #set $fchrClmn = int($input_type.annotation_input.metadata.chromCol) - $zero_based\n+ #set $strandClmn = int($input_type.annotation_input.metadata.strandCol) - $zero_based\n+ #set $geneNameClmn = int($input_type.annotation_input.metadata.nameCol) - $zero_based\n+ #set $txStartClmn = int(str($input_type.tx_start_col)) - $zero_based\n+ #set $txEndClmn = int(str($input_type.tx_end_col)) - $zero_based\n+ #set $cdsStartClmn = int(str($input_type.cds_start_col)) - $zero_based\n+ #set $cdsEndClmn = int(str($input_type.cds_end_col)) - $zero_based\n+ #set $startExsClmn = int(str($input_type.exs_start_col)) - $zero_based\n+ #set $endExsClmn = int(str($input_type.exs_end_col)) - $zero_based\n+ \'--cdsEndClmn=$cdsEndClmn\' \'--cdsStartClmn=$cdsStartClmn\' \'--endExsClmn=$endExsClmn\' \'--fchrClmn=$fchrClmn\' \'--geneNameClmn=$geneNameClmn\' \'--gene_info=$input_type.annotation_input\' \'--sequence=$input_type.fasta_input\' \'--startExsClmn=$startExsClmn\' \'--strandClmn=$strandClmn\' \'--txEndClmn=$txEndClmn\' \'--txStartClmn=$txStartClmn\'\n+ #end if\n+ </command>\n+\n+ <inputs>\n+ <param name="input" type="data" format="gd_genotype,gd_snp" label="Genotype/SNP dataset">\n+ <validator type="metadata" check="scaffold" message="scaffold missing" />\n+ <validator type="metadata" check="pos" message="pos missing" />\n+ </param>\n+ <param name="indivs_input" type="data" format="gd_indivs" label="Individuals dataset" />\n+ <conditional name="input_type">\n+ <param name="choice" type="select" format="integer" label="Input type">\n+ <option value="0" selected="true">Coverage</option>\n+ <option value="1">Genes</option>\n+ </param>\n+ <when value="0">\n+ <param name="coverage_input" type="data" format="gd_genotype,gd_snp" label="Coverage dataset">\n+ <validator type="metadata" check="scaffold" message="scaffold missing" />\n+ <validator type="metadata" check="pos" message="pos missing" />\n+ </param>\n+ <param name="coverage_threshold" type="integer" min="1" value="1" label="Coverage threshold" />\n+ <param name="indivs_threshold" type="float" value="0.5" min="0.0" max="1.0" label="Individuals genotype percentage threshold" />\n+ </when>\n+ <when value="1">\n+ <param name="annotation_input" type="data" format="interval" label="Genes dataset">\n+ <validator type="metadata" check="chromCol" message="chromCol missing" />\n+ <validator type="metadata" check="stra'..b'"Genes exon starts column" />\n+ <param name="exs_end_col" type="data_column" data_ref="input" label="Genes exon ends column" />\n+ <param name="fasta_input" type="data" format="fasta" label="FASTA dataset" />\n+ </when>\n+ </conditional>\n+ </inputs>\n+\n+ <outputs>\n+ <data name="output1" format="txt" />\n+ </outputs>\n+\n+ <help>\n+**What it does**\n+\n+This tool creates phylip formatted files from two different input types:\n+coverage and genes.\n+\n+If the coverage option is selected the inputs for the program are:\n+\n+ 1. a gd_indivs table\n+ 2. a gd_genotype file with the coverage information for individuals in the gd_indivs table\n+ 3. a gd_genotype file with the genotype information for individuals in the gd_indivs table\n+ 4. a coverage threshold (optional)\n+ 5. a percentage of individuals (threshold).\n+\n+The program produces a phylip formatted file using the sequence in the\n+genotype file as a template. In this sequence nucleotides for each\n+sequence that are below the coverage threshold, or the positions with\n+a percentage of individuals below the selected value are replaced by "N".\n+\n+If the gene option is selected the inputs for the program are:\n+\n+ 1. a gd_indivs table\n+ 2. a gene dataset table with a gene name in the first column\n+ 3. the column with transcript start in the gene dataset table\n+ 4. the column with transcript end in the gene dataset table\n+ 5. the column with coding start in the gene dataset table\n+ 6. the column with coding end in the gene dataset table\n+ 7. the column with exon starts (comma-separated) in the gene dataset table\n+ 8. the column with exon ends (comma-separated) in the gene dataset table\n+ 9. a FASTA formatted file for all the genes of interest with their names as headers (NOTE: these names should be the same in the input gene dataset table).\n+ \n+The program produces as output one phylip formatted file for each gene\n+in the gene dataset table.\n+\n+-----\n+\n+**Example**\n+\n+In a case were the option coverage is selected, for the inputs:\n+\n+- gd_indivs::\n+\n+ 7 W_Java\n+ 10 E_Java\n+ 16 Pen_Ma\n+ ...\n+\n+- Genotype table::\n+\n+ chrM 15 T C -1 -1 2 -1 -1 2 -1 -1 -1 -1 -1 2 -1 -1 -1 -1 0 -1 -1\n+ chrM 18 G A -1 -1 0 -1 -1 0 -1 -1 -1 -1 -1 0 -1 -1 -1 -1 0 -1 -1\n+ chrM 20 C T -1 -1 0 -1 -1 2 -1 -1 -1 -1 -1 0 -1 -1 -1 -1 0 -1 -1\n+ ...\n+\n+- Coverage table::\n+\n+ chrM 0 G G 0 0 0 0 0 0 0 0 0 0 0 0 0\n+ chrM 1 T T 0 0 3 0 0 50 0 0 0 0 0 2 0\n+ chrM 2 T T 0 0 5 0 0 50 0 0 0 0 0 2 0\n+ ...\n+\n+- Coverage threshold = 0\n+\n+- Percentage of individuals = 0.0\n+\n+- The output is::\n+\n+ 4 19 15428\n+ W_Java GTTCATCATGTTCATCGAAT\n+ E_Java GTTCATCATGTTCATCGAAC\n+ Pen_Ma GTTCATCATGTTCATCGAAT\n+\n+In a case were option genotype is selected with the inputs:\n+\n+- Gene dataset table input::\n+\n+ 1 ENSLAFT00000017123 chrM + 1002 1061 1002 1061 1 1002, 1061, 0 ENSLAFG00000017122 cmpl incmpl 0, BTRC ENSLAFT00000017123 ENSLAFP00000014355\n+ 1 ENSLAFT00000037164 chrM - 1058 1092 1062 1073 1 1062,1068 1065,1073 0 ENSLAFG00000007680 cmpl cmpl 0, MYOF ENSLAFT00000037164 ENSLAFP00000025175 26509\n+ 1 ENSLAFT00000008925 chrM + 990 1000 990 1000 1 990, 1000, 0 ENSLAFG00000008924 incmpl incmpl 0, PRKG1 ENSLAFT00000008925 ENSLAFP00000007492\n+ ...\n+\n+In this table:\n+\n+ column with transcript start = 5\n+ column with transcript end = 6\n+ column with coding start = 7\n+ column with coding end = 8\n+ column with exon starts = 10\n+ column with exon ends = 11\n+\n+- gd_indivs::\n+\n+ 7 W_Java\n+ 10 E_Java\n+ 16 Pen_Ma\n+ ...\n+\n+- Genotype table::\n+\n+ chrM 1005 T C -1 -1 2 -1 -1 2 -1 -1 -1 -1 -1 2 -1 -1 -1 -1 0 -1 -1\n+ chrM 1060 G A -1 -1 0 -1 -1 0 -1 -1 -1 -1 -1 0 -1 -1 -1 -1 0 -1 -1\n+ chrM 991 C T -1 -1 0 -1 -1 2 -1 -1 -1 -1 -1 0 -1 -1 -1 -1 0 -1 -1\n+ ...\n+\n+The outputs are going to one file for each sequence in the input gene\n+dataset table (as long as they are included in the input FASTA file).\n+ </help>\n+</tool>\n' |
b |
diff -r 4188853b940b -r a631c2f6d913 nucleotide_diversity_pi.xml --- a/nucleotide_diversity_pi.xml Fri Jul 26 12:51:13 2013 -0400 +++ b/nucleotide_diversity_pi.xml Fri Sep 20 13:25:27 2013 -0400 |
b |
@@ -25,6 +25,10 @@ <data name="output" format="txt" /> </outputs> + <requirements> + <requirement type="package" version="0.1">gd_c_tools</requirement> + </requirements> + <help> **What it does** |
b |
diff -r 4188853b940b -r a631c2f6d913 offspring_heterozygosity.py --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/offspring_heterozygosity.py Fri Sep 20 13:25:27 2013 -0400 |
[ |
@@ -0,0 +1,59 @@ +#!/usr/bin/env python + +import sys +import gd_util + +from Population import Population + +################################################################################ + +if len(sys.argv) != 7: + gd_util.die('Usage') + +input, input_type, ind_arg, p1_input, p2_input, output = sys.argv[1:] + +p_total = Population() +p_total.from_wrapped_dict(ind_arg) + +p1 = Population() +p1.from_population_file(p1_input) +if not p_total.is_superset(p1): + gd_util.die('There is an individual in the first population that is not in the SNP table') + +p2 = Population() +p2.from_population_file(p2_input) +if not p_total.is_superset(p2): + gd_util.die('There is an individual in the second population that is not in the SNP table') + +################################################################################ + +prog = 'offspring_heterozygosity' + +args = [ prog ] +args.append(input) # a Galaxy SNP table + +for tag in p1.tag_list(): + column, name = tag.split(':') + + if input_type == 'gd_genotype': + column = int(column) - 2 + + tag = '{0}:{1}:{2}'.format(column, 0, name) + args.append(tag) + +for tag in p2.tag_list(): + column, name = tag.split(':') + + if input_type == 'gd_genotype': + column = int(column) - 2 + + tag = '{0}:{1}:{2}'.format(column, 1, name) + args.append(tag) + +with open(output, 'w') as fh: + gd_util.run_program(prog, args, stdout=fh) + +################################################################################ + +sys.exit(0) + |
b |
diff -r 4188853b940b -r a631c2f6d913 offspring_heterozygosity.xml --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/offspring_heterozygosity.xml Fri Sep 20 13:25:27 2013 -0400 |
b |
@@ -0,0 +1,68 @@ +<tool id="gd_offspring_heterozygosity" name="Pairs sequenced" version="1.0.0"> + <description>: Offspring estimated heterozygosity of sequenced pairs</description> + + <command interpreter="python"> + #import json + #import base64 + #import zlib + #set $ind_names = $input.dataset.metadata.individual_names + #set $ind_colms = $input.dataset.metadata.individual_columns + #set $ind_dict = dict(zip($ind_names, $ind_colms)) + #set $ind_json = json.dumps($ind_dict, separators=(',',':')) + #set $ind_comp = zlib.compress($ind_json, 9) + #set $ind_arg = base64.b64encode($ind_comp) + offspring_heterozygosity.py '$input' '$input.ext' '$ind_arg' '$p1_input' '$p2_input' '$output' + </command> + + <inputs> + <param name="input" type="data" format="gd_snp,gd_genotype" label="SNP dataset" /> + <param name="p1_input" type="data" format="gd_indivs" label="First individuals dataset" /> + <param name="p2_input" type="data" format="gd_indivs" label="Second individuals dataset" /> + </inputs> + + <outputs> + <data name="output" format="txt" /> + </outputs> + + <requirements> + <requirement type="package" version="0.1">gd_c_tools</requirement> + </requirements> + + <!-- + <tests> + </tests> + --> + + <help> + +**Dataset formats** + +The input datasets are in gd_snp_, gd_genotype_, and gd_indivs_ formats. +The output dataset is in text_ format. + +.. _gd_snp: ./static/formatHelp.html#gd_snp +.. _gd_genotype: ./static/formatHelp.html#gd_genotype +.. _gd_indivs: ./static/formatHelp.html#gd_indivs +.. _text: ./static/formatHelp.html#text + +----- + +**What it does** + +For each pair of individuals, one from each specified set, the program +computes the expected heterozygosity of any offspring of the pair, i.e., +the probability that the offspring has distinct nucleotides at a randomly +chosen autosomal SNP. In other words, we add the following numbers for +each autosomal SNP where both genotypes are defined, then divide by the +number of those SNPs: + +0 if the individuals are homozygous for the same nucleotide + +1 if the individuals are homozygous for different nucleotides + +1/2 otherwise (i.e., if one or both individuals are heterozygous) + +A SNP is ignored if one or both individuals have an undefined genotype +(designated as -1). + </help> +</tool> |
b |
diff -r 4188853b940b -r a631c2f6d913 offspring_heterozygosity_pedigree.py --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/offspring_heterozygosity_pedigree.py Fri Sep 20 13:25:27 2013 -0400 |
[ |
@@ -0,0 +1,58 @@ +#!/usr/bin/env python + +import sys +import gd_util + +from Population import Population + +def load_and_check_pop(file, total_pop, name): + p = Population() + p.from_population_file(file) + if not total_pop.is_superset(p): + gd_util.die('There is an individual in the {0} that is not in the SNP table'.format(name)) + return p + +def append_breeders_from_file(the_list, filename, kind): + with open(filename) as fh: + for line in fh: + elems = line.split() + breeder = elems[0].rstrip('\r\n') + the_list.append('{0}:{1}'.format(kind, breeder)) + +################################################################################ + +if len(sys.argv) != 9: + gd_util.die('Usage') + +input, input_type, pedigree, ind_arg, founders, b1_input, b2_input, output = sys.argv[1:] + +p_total = Population() +p_total.from_wrapped_dict(ind_arg) + +f1 = load_and_check_pop(founders, p_total, 'founders') + +################################################################################ + +prog = 'offspring_heterozygosity2' + +args = [ prog ] +args.append(input) # a Galaxy SNP table +args.append(pedigree) # a pedigree, where the SNP table is for the founders + +for tag in f1.tag_list(): + column, name = tag.split(':') + if type == 'gd_genotype': + column = int(column) - 2 + tag = 'founder:{0}:{1}'.format(column, name) + args.append(tag) + +append_breeders_from_file(args, b1_input, 0) +append_breeders_from_file(args, b2_input, 1) + +with open(output, 'w') as fh: + gd_util.run_program(prog, args, stdout=fh) + +################################################################################ + +sys.exit(0) + |
b |
diff -r 4188853b940b -r a631c2f6d913 offspring_heterozygosity_pedigree.xml --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/offspring_heterozygosity_pedigree.xml Fri Sep 20 13:25:27 2013 -0400 |
b |
@@ -0,0 +1,81 @@ +<tool id="gd_offspring_heterozygosity_pedigree" name="Founders sequenced" version="1.0.0"> + <description>: Offspring estimated heterozygosity from a pedigree with sequenced founders</description> + + <command interpreter="python"> + #import json + #import base64 + #import zlib + #set $ind_names = $input.dataset.metadata.individual_names + #set $ind_colms = $input.dataset.metadata.individual_columns + #set $ind_dict = dict(zip($ind_names, $ind_colms)) + #set $ind_json = json.dumps($ind_dict, separators=(',',':')) + #set $ind_comp = zlib.compress($ind_json, 9) + #set $ind_arg = base64.b64encode($ind_comp) + offspring_heterozygosity_pedigree.py '$input' '$input.ext' '$pedigree' '$ind_arg' '$founders' '$b1_input' '$b2_input' '$output' + </command> + + <inputs> + <param name="input" type="data" format="gd_snp,gd_genotype" label="SNP dataset" /> + <param name="pedigree" type="data" format="txt" label="Pedigree dataset" /> + <param name="founders" type="data" format="gd_indivs" label="Founders dataset" /> + <param name="b1_input" type="data" format="txt" label="First breeders dataset" /> + <param name="b2_input" type="data" format="txt" label="Second breeders dataset" /> + </inputs> + + <outputs> + <data name="output" format="txt" /> + </outputs> + + <requirements> + <requirement type="package" version="0.1">gd_c_tools</requirement> + </requirements> + + <!-- + <tests> + </tests> + --> + + <help> + +**Dataset formats** + +The input datasets are in gd_snp_, gd_genotype_, text_, and gd_indivs_ formats. +The output dataset is in text_ format. + +.. _gd_snp: ./static/formatHelp.html#gd_snp +.. _gd_genotype: ./static/formatHelp.html#gd_genotype +.. _gd_indivs: ./static/formatHelp.html#gd_indivs +.. _text: ./static/formatHelp.html#text + +----- + +**What it does** + +The user provides a Galaxy SNP table (gd_snp or gd_genotype format) that +includes the founders of a pedigree, as well as two sets of individuals. +The pedigree is specified by a text file with one row per individual, +containing (1) the individual's name, (2) the name of one of the +individual's parents, which must have occurred at the start of a previous +line, and (3) the name of the individual's other parent, which occurred at +the start of a previous line. For a pedigree founder, both parent names +are replaced by "-". The founders are specified by a table in +gd_indivs format, e.g., as produced by "Specify individuals" +tool. Every founder must have genotypes supplied in the SNP table, +and both parents need to be given as "-" in the pedigree. +Conversely, every pedigree individual whose parents are "-" +must be named as a founder. + +The user also provides two files that specify a set of names of +individuals. The first word on each line names an individual (one +line per individual); any subsequent information on a line is ignored. +The name of each individual must appear at the start of a line in the +pedigree. + +For each pair of individuals, one from each specified set, the program +computes the expected heterozygosity of any offspring of the pair, +i.e., the probability that the offspring has distinct nucleotides at +a randomly chosen autosomal SNP. A SNP is ignorned if one or both +potential parents have an ancestor with undefined genotype (designated +as -1 in the SNP table). + </help> +</tool> |
b |
diff -r 4188853b940b -r a631c2f6d913 pathway_image.xml --- a/pathway_image.xml Fri Jul 26 12:51:13 2013 -0400 +++ b/pathway_image.xml Fri Sep 20 13:25:27 2013 -0400 |
b |
@@ -27,6 +27,10 @@ <data name="output" format="png" /> </outputs> + <requirements> + <requirement type="package" version="0.2.5">mechanize</requirement> + </requirements> + <tests> <test> <param name="input" value="test_in/sample.gd_sap" ftype="gd_sap" /> @@ -77,7 +81,7 @@ output showing pathway cfa05214: -.. image:: ${static_path}/images/gd_pathway_image.png +.. image:: $PATH_TO_IMAGES/gd_pathway_image.png </help> </tool> |
b |
diff -r 4188853b940b -r a631c2f6d913 pca.xml --- a/pca.xml Fri Jul 26 12:51:13 2013 -0400 +++ b/pca.xml Fri Sep 20 13:25:27 2013 -0400 |
b |
@@ -13,6 +13,12 @@ <data name="output" format="html" /> </outputs> + <requirements> + <requirement type="package" version="5.0.1">eigensoft</requirement> + <requirement type="package" version="0.1">gd_c_tools</requirement> + <requirement type="package" version="3.2.1">beautifulsoup</requirement> + </requirements> + <!-- <tests> <test> |
b |
diff -r 4188853b940b -r a631c2f6d913 phylogenetic_tree.xml --- a/phylogenetic_tree.xml Fri Jul 26 12:51:13 2013 -0400 +++ b/phylogenetic_tree.xml Fri Sep 20 13:25:27 2013 -0400 |
b |
@@ -136,6 +136,13 @@ </test> </tests> + <requirements> + <requirement type="package" version="1.3">phast</requirement> + <requirement type="package" version="1.1">quicktree</requirement> + <requirement type="package" version="0.1">gd_c_tools</requirement> + </requirements> + + <help> **Dataset formats** |
b |
diff -r 4188853b940b -r a631c2f6d913 population_structure.xml --- a/population_structure.xml Fri Jul 26 12:51:13 2013 -0400 +++ b/population_structure.xml Fri Sep 20 13:25:27 2013 -0400 |
b |
@@ -14,6 +14,10 @@ <data name="output" format="html" /> </outputs> + <requirements> + <requirement type="package" version="3.2.1">beautifulsoup</requirement> + </requirements> + <!-- <tests> <test> |
b |
diff -r 4188853b940b -r a631c2f6d913 prepare_population_structure.xml --- a/prepare_population_structure.xml Fri Jul 26 12:51:13 2013 -0400 +++ b/prepare_population_structure.xml Fri Sep 20 13:25:27 2013 -0400 |
b |
@@ -71,6 +71,10 @@ </data> </outputs> + <requirements> + <requirement type="package" version="0.1">gd_c_tools</requirement> + </requirements> + <tests> <test> <param name="input" value="test_in/sample.gd_snp" ftype="gd_snp" /> |
b |
diff -r 4188853b940b -r a631c2f6d913 rank_pathways.xml --- a/rank_pathways.xml Fri Jul 26 12:51:13 2013 -0400 +++ b/rank_pathways.xml Fri Sep 20 13:25:27 2013 -0400 |
b |
@@ -53,6 +53,13 @@ <data name="output" format="tabular" /> </outputs> + <requirements> + <requirement type="package" version="0.2.5">mechanize</requirement> + <requirement type="package" version="1.8.1">networkx</requirement> + <requirement type="package" version="0.1.4">fisher</requirement> + </requirements> + + <tests> <test> </test> @@ -64,7 +71,7 @@ The query dataset has a column containing ENSEMBL transcript codes for the gene set of interest, while the background dataset has one column -with ENSEMBL transcript codes and another with GO terms, for some larger +with ENSEMBL transcript codes and another with KEGG pathways, for some larger universe of genes. All of the input and output datasets are in tabular_ format. The input |
b |
diff -r 4188853b940b -r a631c2f6d913 rank_terms.xml --- a/rank_terms.xml Fri Jul 26 12:51:13 2013 -0400 +++ b/rank_terms.xml Fri Sep 20 13:25:27 2013 -0400 |
b |
@@ -25,6 +25,10 @@ <data name="output" format="tabular" /> </outputs> + <requirements> + <requirement type="package" version="0.1.4">fisher</requirement> + </requirements> + <help> **Dataset formats** |
b |
diff -r 4188853b940b -r a631c2f6d913 raxml.py --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/raxml.py Fri Sep 20 13:25:27 2013 -0400 |
[ |
@@ -0,0 +1,51 @@ +#!/usr/bin/env python + +import random +import sys +import shutil +import gd_util + +################################################################################ + +if len(sys.argv) != 3: + gd_util.die('Usage') + +input, output = sys.argv[1:] +random.seed() + +################################################################################ + +prog = 'raxmlHPC' + +args = [ prog ] + +## required: -s sequenceFileName -n outputFileName -m substitutionModel +## we supply -s, -n (they are not allowed from user) + +args.append('-s') # name of the alignment data file in PHYLIP format +args.append(input) + +args.append('-n') # name of the output file +args.append('fake') + +## default options +args.append('-m') # substitutionModel +args.append('GTRGAMMA') # GTR + Optimization of substitution rates + GAMMA model of rate + # heterogeneity (alpha parameter will be estimated) + +args.append('-N') # number of alternative runs on distinct starting trees +args.append(1000) + +args.append('-f') # select algorithm +args.append('a') # rapid Bootstrap analysis and search for + # best-scoring ML tree in one program run + +args.append('-x') # integer random seed and turn on rapid bootstrapping +args.append(random.randint(0,100000000000000)) + +args.append('-p') # random seed for parsimony inferences +args.append(random.randint(0,100000000000000)) + +gd_util.run_program(prog, args) +shutil.copy2('RAxML_bipartitions.fake', output) +sys.exit(0) |
b |
diff -r 4188853b940b -r a631c2f6d913 raxml.xml --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/raxml.xml Fri Sep 20 13:25:27 2013 -0400 |
b |
@@ -0,0 +1,37 @@ +<tool id="gd_raxml" name="RAxML" version="1.0.0"> + <description>: construct a maximum-likelihood phylogenetic tree</description> + + <command interpreter="python"> + raxml.py '$input' '$output' + </command> + + <inputs> + <param name="input" type="data" format="txt" label="PHYLIP dataset" /> + </inputs> + + <outputs> + <data name="output" format="nhx" /> + </outputs> + + + <requirements> + <requirement type="package" version="7.7.6">raxml</requirement> + </requirements> + + <help> +**What it does** + +This tool runs RAxML on a phylip formatted file and returns a maximum +likelihood phylogram supported by a desired number of bootstraps. + +This program takes as input a phylip formatted file and optionally a +number of parameters (for further information consult the manual_), +and returns a Newick formatted tree that can be explored with Phyloviz. + +By default the program runs 1,000 fast bootstraps on the best likelihood +tree constructed with the GRT + gamma model. + +.. _manual: http://sco.h-its.org/exelixis/oldPage/RAxML-Manual.7.0.4.pdf + + </help> +</tool> |
b |
diff -r 4188853b940b -r a631c2f6d913 reorder.xml --- a/reorder.xml Fri Jul 26 12:51:13 2013 -0400 +++ b/reorder.xml Fri Sep 20 13:25:27 2013 -0400 |
b |
@@ -1,5 +1,5 @@ -<tool id="gd_reorder" name="Reorder" version="1.0.0"> - <description>individuals</description> +<tool id="gd_reorder" name="Reorder individuals" version="1.0.0"> + <description>: exchange rows in the above picture</description> <command interpreter="python"> reorder.py '$input' '$output' '$order' @@ -15,5 +15,64 @@ </outputs> <help> +**Dataset formats** + +The input and output datasets are in gd_indivs_ format. + +.. _gd_indivs: ./static/formatHelp.html#gd_indivs + +----- + +**What it does** + +The user picks a gd_indivs dataset from their history and specifies +a new ordering. This tool creates a new gd_indivs dataset with the +individuals reordered as specified by the user. + +The new ordering is a list of comma separated ranges (e.g **5,6-12,20**). +Ranges can be either a single number (e.g. **3**) or two dash separated +numbers (e.g. **3-5**). The numbers represent the line number of +gd_indivs dataset. Line numbers that are not listed will appear on the +output after the specified line numbers in their same relative ordering. + +----- + +**Example** + +Input dataset (six rows):: + + 18 McClintock + 22 Peltonen-Palotie + 26 Sager + 30 Franklin + 34 Auerbach + 38 Stevens + +new ordering "**1,3-4**" will return:: + + 18 McClintock + 26 Sager + 30 Franklin + 22 Peltonen-Palotie + 34 Auerbach + 38 Stevens + +new ordering "**3,5,1,6**" will return:: + + 26 Sager + 34 Auerbach + 18 McClintock + 38 Stevens + 22 Peltonen-Palotie + 30 Franklin + +new ordering "**3-1,6,4-5**" will return:: + + 26 Sager + 22 Peltonen-Palotie + 18 McClintock + 38 Stevens + 30 Franklin + 34 Auerbach </help> </tool> |
b |
diff -r 4188853b940b -r a631c2f6d913 static/images/cluster_kegg_formula.png |
b |
Binary file static/images/cluster_kegg_formula.png has changed |
b |
diff -r 4188853b940b -r a631c2f6d913 static/images/gd_coverage.png |
b |
Binary file static/images/gd_coverage.png has changed |
b |
diff -r 4188853b940b -r a631c2f6d913 tool_dependencies.xml --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/tool_dependencies.xml Fri Sep 20 13:25:27 2013 -0400 |
b |
@@ -0,0 +1,36 @@ +<?xml version="1.0"?> +<tool_dependency> + <package name="beautifulsoup" version="3.2.1"> + <repository prior_installation_required="True" toolshed="http://toolshed.g2.bx.psu.edu/" owner="miller-lab" name="package_beautifulsoup_3_2_1" changeset_revision="83c21b81ee9d" /> + </package> + <package name="eigensoft" version="5.0.1"> + <repository prior_installation_required="True" toolshed="http://toolshed.g2.bx.psu.edu/" owner="miller-lab" name="package_eigensoft_5_0_1" changeset_revision="02f04f3579b5" /> + </package> + <package name="fisher" version="0.1.4"> + <repository prior_installation_required="True" toolshed="http://toolshed.g2.bx.psu.edu/" owner="miller-lab" name="package_fisher_0_1_4" changeset_revision="c84c287b81a4" /> + </package> + <package name="gd_c_tools" version="0.1"> + <repository prior_installation_required="True" toolshed="http://toolshed.g2.bx.psu.edu/" owner="miller-lab" name="package_gd_c_tools_0_1" changeset_revision="7361ee4b5f40" /> + </package> + <package name="matplotlib" version="1.2.1"> + <repository prior_installation_required="True" toolshed="http://toolshed.g2.bx.psu.edu/" owner="iuc" name="package_matplotlib_1_2" changeset_revision="9d164359606b" /> + </package> + <package name="mechanize" version="0.2.5"> + <repository prior_installation_required="True" toolshed="http://toolshed.g2.bx.psu.edu/" owner="miller-lab" name="package_mechanize_0_2_5" changeset_revision="59801857421b" /> + </package> + <package name="munkres" version="1.0.5.4"> + <repository prior_installation_required="True" toolshed="http://toolshed.g2.bx.psu.edu/" owner="miller-lab" name="package_munkres_1_0_5_4" changeset_revision="613b89b28767" /> + </package> + <package name="networkx" version="1.8.1"> + <repository prior_installation_required="True" toolshed="http://toolshed.g2.bx.psu.edu/" owner="miller-lab" name="package_networkx_1_8_1" changeset_revision="43c20433f2d6" /> + </package> + <package name="phast" version="1.3"> + <repository prior_installation_required="True" toolshed="http://toolshed.g2.bx.psu.edu/" owner="miller-lab" name="package_phast_1_3" changeset_revision="f633177177b9" /> + </package> + <package name="quicktree" version="1.1"> + <repository prior_installation_required="True" toolshed="http://toolshed.g2.bx.psu.edu/" owner="miller-lab" name="package_quicktree_1_1" changeset_revision="dae77031fa2f" /> + </package> + <package name="raxml" version="7.7.6"> + <repository prior_installation_required="True" toolshed="http://toolshed.g2.bx.psu.edu/" owner="miller-lab" name="package_raxml_7_7_6" changeset_revision="77f73a8c45be" /> + </package> +</tool_dependency> |