ensembl_longest_cds_per_gene: ensembl_longest_cds_per

comparison ensembl_longest_cds_per_gene.py @ 2:6cf9f7f6509c draft default tip

planemo upload for repository https://github.com/TGAC/earlham-galaxytools/tree/master/tools/ensembl_longest_cds_per_gene commit 651fae48371f845578753052c6fe173e3bb35670

author	earlhaminst
date	Wed, 15 Mar 2017 20:23:13 -0400
parents	4dba69135845
children

comparison

equal deleted inserted replaced

-:a07680f3033a
+:6cf9f7f6509c
 """
 This script reads a CDS FASTA file from Ensembl and outputs a FASTA file with
-only the longest CDS sequence for each gene. The header of the sequences in the
+only the longest CDS sequence for each gene.
-output file will be the transcript id without version.
 """
 from __future__ import print_function
 import collections
 import optparse
 def remove_id_version(s):
 """
 Remove the optional '.VERSION' from an Ensembl id.
 """
-return s.split('.')[0]
+if s.startswith('ENS'):
+return s.split('.')[0]
+else:
+return s
 parser = optparse.OptionParser()
 parser.add_option('-f', '--fasta', dest="input_fasta_filename",
 help='CDS file in FASTA format from Ensembl')
 gene_transcripts_dict = dict()
 for entry in FASTAReader_gen(options.input_fasta_filename):
 transcript_id, rest = entry.header[1:].split(' ', 1)
-transcript_id = remove_id_version(transcript_id)
 gene_id = None
 for s in rest.split(' '):
 if s.startswith('gene:'):
 gene_id = remove_id_version(s[5:])
 break
 # first one to appear in the FASTA file is selected
 selected_transcript_ids = [max(transcript_id_lengths, key=lambda _: _[1])[0] for transcript_id_lengths in gene_transcripts_dict.values()]
 with open(options.output_fasta_filename, 'w') as output_fasta_file:
 for entry in FASTAReader_gen(options.input_fasta_filename):
-transcript_id = remove_id_version(entry.header[1:].split(' ')[0])
+transcript_id = entry.header[1:].split(' ')[0]
 if transcript_id in selected_transcript_ids:
-output_fasta_file.write(">%s\n%s\n" % (transcript_id, entry.sequence))
+output_fasta_file.write("%s\n%s\n" % (entry.header, entry.sequence))

Mercurial > repos > earlhaminst > ensembl_longest_cds_per_gene

comparison ensembl_longest_cds_per_gene.py @ 2:6cf9f7f6509c draft default tip