# HG changeset patch # User brigidar # Date 1446840737 18000 # Node ID 3a352cb5711770718c2166ebcaf6b60343316932 # Parent d5725351bf221ab481088c4caac09af42c7551cc new script updates diff -r d5725351bf22 -r 3a352cb57117 vcf_snp.py --- a/vcf_snp.py Mon Nov 02 19:42:54 2015 -0500 +++ b/vcf_snp.py Fri Nov 06 15:12:17 2015 -0500 @@ -1,13 +1,14 @@ #!/usr/bin/env python +##!/usr/local/anaconda/bin/python ######################################################################################### # # # Name : vcf_snp.py # -# Version : 0.2 # +# Version : 0.3 # # Project : extract snp from vcf # # Description : Script to exctract snps # # Author : Brigida Rusconi # -# Date : November 02 2015 # +# Date : November 06 2015 # # # ######################################################################################### @@ -30,40 +31,54 @@ parser.add_argument('-o', '--output', help="snp tab") parser.add_argument('-s', '--snp_table', help="vcf") +parser.add_argument('-p', '--positions', help="positions") + args = parser.parse_args() output_file = args.output input_file = args.snp_table +input_file2=args.positions #------------------------------------------------------------------------------------------ #read in file as dataframe df =read_csv(input_file,sep='\t', dtype=object) +df3=read_csv(input_file2,sep='\t', dtype=object, names=['#CHROM','POS']) +#pdb.set_trace() +#get the positions that are missing +if df.index.size!= df3.index.size: + df4=df3[~df3.POS.isin(df.POS)] +#pdb.set_trace() + df4=df4.set_index('POS') + df=df.set_index('POS') + df2=concat([df,df4], axis=1) +#if there is no information for a loci the position is dropped so it fills the nan with N + df5=df2.iloc[:, 2:4].fillna('N') + df5=df5.reindex(df3.POS) +else: + df5=df.iloc[:,3:5] #------------------------------------------------------------------------------------------ -# only columns with qbase and refbase in table -#count_qbase=list(df.columns.values) -#qindexes=[] -#for i, v in enumerate(count_qbase): -# if 'ALT' in v: -# qindexes.append(i) -df2=df.iloc[:,3:5] + #pdb.set_trace() #------------------------------------------------------------------------------------------ +#replaces the . with the nucleotide call of the reference also deals with multiallelic states calling them N ref_list=[] -for i in range(0,df2.index.size): - if df2.iloc[i,1]==".": - ref_list.append(df2.iloc[i,0][0]) +for i in range(0,df5.index.size): + if df5.iloc[i,1]==".": + ref_list.append(df5.iloc[i,0][0]) + elif len(df5.iloc[i,1])>1: + ref_list.append('N') else: - ref_list.append(df2.iloc[i,1][0]) + ref_list.append(df5.iloc[i,1][0]) #pdb.set_trace() # ##------------------------------------------------------------------------------------------ # #save file with output name for fasta -o option and removes header and index with open(output_file,'w') as output: - output.write(df.columns.values[4] + '\t' + ''.join([str(i) for v,i in enumerate(ref_list)])) + output.write('ALT' + '\t' + ''.join([str(i) for v,i in enumerate(ref_list)])) ##------------------------------------------------------------------------------------------ \ No newline at end of file