view commons/core/parsing/README_MultiFasta2SNPFile @ 9:1eb55963fe39

Updated CompareOverlappingSmall*.py
author m-zytnicki
date Thu, 14 Mar 2013 05:23:05 -0400
parents 769e306b7933
children
line wrap: on
line source

*** DESCRIPTION: ***
This program takes as input a multifasta file (with sequences already aligned together formated in fasta in the same file), considers the first sequence as the reference sequence, infers polymorphims and generates output files in GnpSNP exchange format.


*** INSTALLATION: ***
Dependancies: 
- First you need Python installed in your system.
- Repet libraries are also required.

*** OPTIONS OF THE LAUNCHER: ***

    -h: this help

Mandatory options:    
         -b: Name of the batch of submitted sequences
         -g: Name of the gene
         -t: Scientific name of the taxon concerned

Exclusive options (use either the first or the second)
         -f: Name of the multifasta input file (for one input file)
         -d: Name of the directory containing multifasta input file(s) (for several input files)



*** COMMAND LINE EXAMPLE (for package use): ***
- First, you need to set up the environment variable PYTHONPATH (lo link with the dependancies).

- Then for one input file (here our example), run:

python multifastaParserLauncher.py -b Batch_test -g GeneX -t "Arabidopsis thaliana" -f Exemple_multifasta_input.fasta


- For several input files, create a directory in the root of the uncompressed package and put your input files in it. Then use this type of command line:

python multifastaParserLauncher.py -b Batch_test -g GeneX -t "Arabidopsis thaliana" -d <Name_of_the_directory>

Each one of the input files will generate a directory with his set of output files.


*** SIMPLE USE (for package use): ***
Two executables (one for windows, the other for linux/unix) are in the package.
They show the command lines to use in order to set up environment variables and then to run the parser on our sample input file (Example_multifasta_input.fasta).
You can edit the executable and custom the command line to use it with your own input file.


*** BACKLOG (next version) ***
When the launcher is called for several input files (with -d option), the parser should be able to generate only one set of files describing all the batches (one batch per input file).
So below are listed the tasks of the backlog dedicated to this feature:

- in Multifasta2SNPFile class: 
  # CONSTRUCTOR: Modify the constructor to add a "several batches" mode called without BatchName and GeneName
  # RUNNING METHOD: Add the run_several_batches(directory) method that will browse the input files and iterate over them to run each of them successively (see runSeveralInputFile() method of the launcher)
  => 2 days
  
  # BATCH MANAGEMENT: Modify createBatchDict() to create one batch per file in the dictionary and add a class variable to point toward the current batch (ex: self._iCurrentLineNumber)
  # BATCH-LINE MANAGEMENT: Modify _completeBatchLineListWithCurrentIndividual method to allow several batch and link lines to batches (for the moment hard coded batch no1)
  # SUBSNP MANAGEMENT: check that all elements (dSUbSNP) added in SubSNP list (lSubSNPFileResults) is linked to the current batch (for the moment hard coded batch no1)
    Impacted methods: manageSNPs(), createSubSNPFromAMissingPolym(), addMissingAllelesAndSubSNPsForOnePolym(), mergeAllelesAndSubSNPsFromOverlappingIndels()
  => + 2 days
  
- in Multifasta2SNPFileWriter class:
  # Modify all the method _write<X>File (ex: _writeSubSNPFile) to write in append mode and externalize all open and close file 
  # Create one method to open all the output files and call it in Multifasta2SNPFile run_several_batches method
  # Create one method to close all the output files and call it in Multifasta2SNPFile run_several_batches method
  
  => + 2 days