Mercurial > repos > davidvanzessen > argalaxy_tools
view igblastparser/igparse.pl @ 9:efa1f5a17b6e draft
Uploaded
author | davidvanzessen |
---|---|
date | Mon, 19 Dec 2016 09:23:01 -0500 |
parents | afe85eb6572e |
children |
line wrap: on
line source
#!/usr/bin/perl =head1 IGBLAST_simple.pl This version (1.4) has been heavily adapted since the original program was first created back in October 2012. Bas Horsman (EMC, Rotterdam, The Netherlands) has contributed with minor - though important - code changes. From V 1.2 onwards a 'Change Log' is included at the end of the program =head2 Usage Requires no modules in general use; the Data::Dumper (supplied as part of the Perl Core module set) might be useful for debugging/adjustment as it allows inspection of the data stores. The program takes a text file of the ./IGBLAST_simple.pl igBLASTOutput.txt <-optional: index of record to process-> Supply the text version of the igBLAST report in the format as in the example below. The extra command line arugment is the record number (aka. BLAST report) to process. If 0 or absent all are processed, if supplied that record (base 1) is processed and the program dies afterwards. =head2 Example Input A standard igBLAST record or set of them in a file; this being typical: BLASTN 2.2.27+ Reference: Stephen F. Altschul, Thomas L. Madden, Alejandro A. Schaffer, Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997), "Gapped BLAST and PSI-BLAST: a new generation of protein database search programs", Nucleic Acids Res. 25:3389-3402. Database: human_gl_V; human_gl_D; human_gl_J 674 sequences; 179,480 total letters Query= HL67IUI01D26LR length=433 xy=1559_1437 region=1 run=R_2012_04_10_11_57_56_ Length=433 Score E Sequences producing significant alignments: (Bits) Value lcl|IGHV3-30*04 330 2e-92 lcl|IGHV3-30-3*01 330 2e-92 lcl|IGHV3-30*01 327 2e-91 lcl|IGHD3-16*01 14.4 11 lcl|IGHD3-16*02 14.4 11 lcl|IGHD1-14*01 12.4 43 lcl|IGHJ4*02 78.3 1e-18 lcl|IGHJ5*02 70.3 4e-16 lcl|IGHJ4*01 68.3 2e-15 Domain classification requested: imgt V(D)J rearrangement summary for query sequence (Top V gene match, Top D gene match, Top J gene match, Chain type, V-J Frame, Strand): IGHV3-30*04 IGHD3-16*01 IGHJ4*02 VH In-frame + V(D)J junction details (V end, V-D junction, D region, D-J junction, J start). Note that possible overlapping nucleotides at VDJ junction (i.e, nucleotides that could be assigned to either joining gene segment) are indicated in parentheses (i.e., (TACT)) but are not included under V, D, or J gene itself AGAGA TATGAGCCCCATCATGACA ACGTTTG CCGGAA ACTAC Alignment summary between query and top germline V gene hit (from, to, length, matches, mismatches, gaps, percent identity) FWR1 27 38 12 11 1 0 91.7 CDR1 39 62 24 22 2 0 91.7 FWR2 63 113 51 50 1 0 98 CDR2 114 137 24 23 1 0 95.8 FWR3 138 251 114 109 5 0 95.6 CDR3 (V region only) 252 259 8 7 1 0 87.5 Total N/A N/A 233 222 11 0 95.3 Alignments <----FWR1--><----------CDR1--------><-----------------------FWR2------ W A A S G F T F N T Y A V H W V R Q A P G K G Query_1 27 TGGGCAGCCTCTGGATTCACCTTCAATACCTATGCTGTGCACTGGGTCCGCCAGGCTCCAGGCAAGGGGC 96 V 95.3% (222/233) IGHV3-30*04 64 ..T......................G..G.......A................................. 133 C A A S G F T F S S Y A M H W V R Q A P G K G V 95.7% (221/231) IGHV3-30-3*01 64 ..T......................G..G.......A................................. 133 V 94.8% (221/233) IGHV3-30*01 64 ..T......................G..G.......A................................. 133 ----------------><----------CDR2--------><---------------------------- L E W V A V I S Y D G S N K N Y A D S V K G R F Query_1 97 TGGAGTGGGTGGCAGTTATATCATATGATGGAAGCAATAAAAACTACGCAGACTCCGTGAAGGGCCGATT 166 V 95.3% (222/233) IGHV3-30*04 134 ..................................T......T............................ 203 L E W V A V I S Y D G S N K Y Y A D S V K G R F V 95.7% (221/231) IGHV3-30-3*01 134 .........................................T............................ 203 V 94.8% (221/233) IGHV3-30*01 134 .A................................T......T............................ 203 ---------------------------FWR3--------------------------------------- T I S R D N S K N T L Y L Q M N S L R V E D T Query_1 167 CACCATCTCCAGAGACAATTCCAAGAACACGTTATATCTGCAAATGAACAGCCTGAGAGTTGAGGACACG 236 V 95.3% (222/233) IGHV3-30*04 204 ...............................C.G.........................C.......... 273 T I S R D N S K N T L Y L Q M N S L R A E D T V 95.7% (221/231) IGHV3-30-3*01 204 ...............................C.G.........................C.......... 273 V 94.8% (221/233) IGHV3-30*01 204 ...............................C.G.........................C.......... 273 --------------> A V Y Y C T R D M S P I M T T F A G N Y W G Q Query_1 237 GCTGTTTATTACTGTACGAGAGATATGAGCCCCATCATGACAACGTTTGCCGGAAACTACTGGGGCCAGG 306 V 95.3% (222/233) IGHV3-30*04 274 .....G.........G.......----------------------------------------------- 296 A V Y Y C A R V 95.7% (221/231) IGHV3-30-3*01 274 .....G.........G.....------------------------------------------------- 294 V 94.8% (221/233) IGHV3-30*01 274 .....G.........G.......----------------------------------------------- 296 D 100.0% (7/7) IGHD3-16*01 12 ------------------------------------------.......--------------------- 18 D 100.0% (7/7) IGHD3-16*02 12 ------------------------------------------.......--------------------- 18 D 100.0% (6/6) IGHD1-14*01 8 -------------------------------------------------......--------------- 13 J 100.0% (39/39) IGHJ4*02 10 -------------------------------------------------------............... 24 J 100.0% (35/35) IGHJ5*02 17 -----------------------------------------------------------........... 27 J 97.4% (38/39) IGHJ4*01 10 -------------------------------------------------------.............A. 24 G T L V T V S S Query_1 307 GAACCCTGGTCACCGTCTCCTCAG 330 J 100.0% (39/39) IGHJ4*02 25 ........................ 48 J 100.0% (35/35) IGHJ5*02 28 ........................ 51 J 97.4% (38/39) IGHJ4*01 25 ........................ 48 Lambda K H 1.10 0.333 0.549 Gapped Lambda K H 1.08 0.280 0.540 Effective search space used: 64847385 Query= HL67IUI01EQMLY length=609 xy=1826_1636 region=1 run=R_2012_04_10_11_57_56_ ...etc... =head2 Example Output Example output from the data above sent: $ ./IGBLAST_simple.pl igBLASTOutput.txt 1 D: Request to process just record '1' received D: printOUTPUTData: Running D: printOUTPUTData: HEADER Printout requested 'ID VDJ Frame Top V Gene Top D Gene Top J Gene CDR1 Seq CDR1 Length CDR2 Seq CDR2 Length CDR3 Seq CDR3 Length CDR3 Found How' OUTPUT: # ID VDJ Frame Top V Gene Top D Gene Top J Gene CDR1 Seq CDR1 Length CDR2 Seq CDR2 Length CDR3 Seq CDR3 Length CDR3 Found How D: ID is: 'HL67IUI01D26LR' D: Minimum base marked-up (27) - aka. $AlignmentStart; maximum: (259) D: Starting Search for CDR3 D: markUpCDR3: Passed Parameters '251, 27, TGGGG....GG., WG.G' (& AA & DNA sequence) D: markUpCDR3: returning: 223, 282, MOTIF_FOUND_IN_BOTH, (3) [NB: offset of :'+ 27' D: CDR3 was found by pattern matching: 'MOTIF_FOUND_IN_BOTH' (250, 309) D: Top Hits (raw)= 'IGHV3-30*04 IGHD3-16*01 IGHJ4*02 VH In-frame +' D: Top Hits (parsed)= 'IGHV3-30*04, IGHD3-16*01, IGHJ4*02, VH, In-frame, +' D: printOUTPUTData: Running OUTPUT: HL67IUI01D26LR In-frame IGHV3-30*04 IGHD3-16*01 IGHJ4*02 GFTFNTYA 23 ISYDGSNK 23 CTRDMSPIMTTFAGNYWGQG 59 MOTIF_FOUND_IN_BOTH =head4 Usage notes: Designed to be easy to "grep -v D:" or "grep OUTPUT:" for to select the parts you need: ./IGBLAST_simple.pl igBLASTOutput.txt 1 | grep OUTPUT: OUTPUT: # ID VDJ Frame Top V Gene Top D Gene Top J Gene CDR1 Seq CDR1 Length CDR2 Seq CDR2 Length CDR3 Seq CDR3 Length CDR3 Found How OUTPUT: HL67IUI01D26LR In-frame IGHV3-30*04 IGHD3-16*01 IGHJ4*02 GFTFNTYA 23 ISYDGSNK 23 CTRDMSPIMTTFAGNYWGQG 59 MOTIF_FOUND_IN_BOTH OUTPUT: HL67IUI01EQMLY In-frame IGHV4-39*01 IGHD2-8*01 IGHJ3*02 GGSISSSSYY 29 IYHSGST 20 CARDATYYSNGFDIWGQG 53 MOTIF_FOUND_IN_BOTH OUTPUT: HL67IUI01CDCLP Out-of-frame IGHV3-23*01 IGHD3-3*01 IGHJ4*02 FSNYAM 16 SGSGDRTY 23 AKAD*FLEWLFRIGDGERLLGPGN 72 MOTIF_FOUND_IN_DNA OUTPUT: HL67IUI01AHRNH N/A IGHV3-33*01 N/A N/A WIHLQ*LW 23 YGMMEVI 23 NOT_FOUND OUTPUT: HL67IUI01DZZ1V Out-of-frame IGHV3-23*01 IGHD5-12*01 IGHJ4*02 GFTFDKYA 23 ILASG 20 LYCASEGDIVASELLSTGARV 62 MOTIF_FOUND_IN_DNA OUTPUT: HL67IUI01DTR2Y Out-of-frame IGHV3-23*01 IGHD5-12*01 IGHJ4*02 LDSPLTNM 23 LYLPVV 20 TVRVRGT*WLRSF*VLGPG 59 MOTIF_FOUND_IN_DNA OUTPUT: HL67IUI01EQL3S In-frame IGHV7-4-1*02 IGHD6-19*01 IGHJ6*02 GYTFRTFT 23 INTNTGTP 23 CAKESGTGSAHFFYGMDVWGQG 65 MOTIF_FOUND_IN_BOTH OUTPUT: HL67IUI01AFG46 In-frame IGLV2-34*01 N/A IGHJ4*02 NOT_FOUND OUTPUT: HL67IUI01EFFKO In-frame IGHV3-11*01 IGHD6-6*01 IGHJ4*02 GFTFSDYY 23 ISYSGGTI 23 CARASGAARHRPLDYWGQG 56 MOTIF_FOUND_IN_BOTH OUTPUT: HL67IUI01B18SG In-frame IGHV3-33*01 IGHD5-12*01 IGHJ4*02 VRQA 11 KYYANSVK 23 RLGGFDYWGQGTLVTVSS 53 MOTIF_FOUND_IN_BOTH OUTPUT: HL67IUI01D6LER In-frame IGHV1-24*01 IGHD3-22*01 IGHJ4*02 GYSLNELS 23 PDPEDDE 23 TVQPSRITMMAVVITRIHWGASGARE 76 MOTIF_FOUND_IN_DNA OUTPUT: HL67IUI01CYCLF N/A IGHV4-39*01 N/A N/A GGSISSSSYY 29 IYYSGST 20 NOT_FOUND OUTPUT: HL67IUI01B4LEE In-frame IGHV7-4-1*02 IGHD6-19*01 IGHJ6*02 GYTFRTFT 23 INTNTGTP 23 CAKESGTGSAHFFYGMDVWGQG 65 MOTIF_FOUND_IN_BOTH OUTPUT: HL67IUI01A4KW4 Out-of-frame IGHV3-23*01 IGHD5-12*01 IGHJ4*02 LDSPLTNM 23 LYLPVV 20 TVRVRGT*WLRSF*IWGQG 58 MOTIF_FOUND_IN_BOTH OUTPUT: HL67IUI01E05BV In-frame IGHV1-24*01 IGHD3-22*01 IGHJ2*01 GYSLNELS 23 PDPEDDE 23 NOT_FOUND OUTPUT: HL67IUI01CVVKY In-frame IGHV1-3*01 IGHD2-15*01 IGHJ1*01 NOT_FOUND OUTPUT: HL67IUI01CN5P2 In-frame IGHV7-4-1*02 IGHD2-21*02 IGHJ5*02 GYSITDYG 23 LNTRTGNP 23 CAVKDARDFVSWGQG 44 MOTIF_FOUND_IN_BOTH OUTPUT: HL67IUI01DUUJ5 In-frame IGHV3-21*01 IGHD1-7*01 IGHJ4*02 GYTFSTYS 23 ISSSSAYR 23 CARDIRLELRDWGQG 44 MOTIF_FOUND_IN_BOTH OUTPUT: HL67IUI01E1AIR Out-of-frame IGHV4-39*01 N/A IGHJ3*01 WGLHRRW**L 29 FVS*RAPR 23 NOT_FOUND OUTPUT: HL67IUI01CCZ8D Out-of-frame IGHV3-23*01 IGHD5-12*01 IGHJ4*02 GFTFDKYA 23 ILASGR 20 YCASEGDIVASELLSTGARE 58 MOTIF_FOUND_IN_DNA OUTPUT: HL67IUI01BT9IR N/A IGHV3-21*02 N/A N/A NOT_FOUND OUTPUT: HL67IUI01COTO0 Out-of-frame IGHV4-39*01 N/A IGHJ3*01 GGFIGGGDNF 29 LYHDGRPA 23 NOT_FOUND OUTPUT: HL67IUI01D994O In-frame IGHV7-4-1*02 IGHD2-21*02 IGHJ5*02 GYSITDYG 23 LNTRTGNP 23 CAVKDARDFVSWGQG 44 MOTIF_FOUND_IN_BOTH OUTPUT: HL67IUI01A08CJ In-frame IGHV4-39*01 IGHD6-13*01 IGHJ5*02 GGSISSSSYY 29 IYYTWEH 21 CERARRGSSWGQLVRPLGPG 62 MOTIF_FOUN OUTPUT: # ID VDJ Frame Top V Gene Top D Gene Top J Gene CDR1 Seq CDR1 Length CDR2 Seq CDR2 Length CDR3 Seq CDR3 Length CDR3 Found How OUTPUT: HL67IUI01D26LR In-frame IGHV3-30*04 IGHD3-16*01 IGHJ4*02 GFTFNTYA 23 ISYDGSNK 23 CTRDMSPIMTTFAGNYWGQG 59 MOTIF_FOUND_IN_BOTH ...etc... =head4 Also, combined grep & sed: $ ./IGBLAST_simple.pl igBLASTOutput.txt | grep OUTPUT: | sed 's/OUTPUT:\t//' =cut =head3 CDR3 Patterns: We use these two variables to try to identify the end of the CDR3 region if igBLAST doesn't report it directly: my $DNACDR3_Pat = "TGGGG....GG."; my $AASequenceMotifPattern = "WG.G"; They are treated as regex's when tested (so use "." to mean any DNA base, rather than 'N' or 'X'). [NB: These are original patterns used for testing, check the code for the current ones.] =cut my $DNACDR3_Pat = "TGGGG....GG."; my $AACDR3_Pat = "WG.G"; use strict; use Data::Dumper; # Set this as to number of the result (aka "record") you want to process or 0 for all: my $ProcessRecord =0; if (defined $ARGV[1]) { $ProcessRecord = pop @ARGV; } #Also accept from the command line: if ($ProcessRecord != 0) { print "D: Request to process just record '$ProcessRecord' received\n"; } #Adjust the record separator: $/="Query= "; my $Record=0; # A simple counter, that we might not use. #Force-loaded header / version information: my $Header = <>; #At the moment we don't use this - so dump it immediately: $Header = undef; #print "D: Force-loaded header / version information: '$Header'\n"; #Print the Header for the output line (we need this once, at the start) print &printOUTPUTData ({"HEADER" => 1})."\n"; while (<>) { =head4 First check - should we be processing this record at all? =cut $Record++; #Increment the record counter: #Do we process this record - or all records? if ($ProcessRecord != $Record && $ProcessRecord != 0) { next; } #We need to increment the record counter before we increment =head4 Setup the output line storage and print the header: We enter this initially and work to change it: $DomainBoundaries{"CDR3"}{"FoundHow"} = "NOT_FOUND"; =cut my %OUTPUT_Data; #To collect data for the output line in #Assume the first and work to find better: $OUTPUT_Data{"CDR3 Found How"} = "NOT_FOUND"; #The whole record - one per read - is now stored in $_ my @Lines =split (/[\r\n]+/,$_); # split on windows/linux/mac new lines #If you are interested enable either of the next lines depending on how curious you are as to how the splitting went: #print "D: Record #$Record\n"; print $_; print "\n---------\n"; print "D: ''$Lines[0]'\nD: ...etc...'\nD: ############\n"; =head3 Get the ID Quite easy: the first field on the first line: Query= HL67IUI01DTR2Y length=577 xy=1452_0984 region=1 =cut (my $ID) = $Lines[0]=~ m/^(\S+)/; unless (defined $ID && $ID ne "") { # So a near total failure...? $OUTPUT_Data{"ID"} = "Unknown"; print &printOUTPUTData (\%OUTPUT_Data)."\n"; next; #No ID is terminal for this record } else { print "D: ID is: '$ID'\n"; $OUTPUT_Data{"ID"} = $ID; } =head3 Declare the variables we will need here in the next few sections to store data =cut my $CurrentRegion; my $RegionMarkup; #So we can sync the coordinated of the alignment up to the domains found: my $Query_Start = -1; my $Query_End = -1; #Where on the Query Sequence (i.e. the 454 read) does the alignment start & stop? my $ThisQueryStart =-1; my $ThisQueryEnd =-1; #Think $ThisQueryEnd isn't used at the moment. my $DNAQuerySequence =""; #The actual DNA Query sequence... my $AAQuerySequence = ""; #As this changes with the alleles identified: my $CurrentAASequence; #The main storage variables my %Alginments; my %Alleles; my %DomainBoundaries; =head2 Stanza 1: Get the general structure of the sequence identified =head3 Method 1: Use the table supplied Technically this valid for the top hit...realistically this is the only information we have reported to us so we use this or nothing. This is fine for the top hit which is likely what we are interested in....but for the 2nd or 3rd? Who knows! Targets this block: Alignment summary between query and top germline V gene hit (from, to, length, matches, mismatches, gaps, percent identity) FWR1 167 240 75 72 2 1 96 CDR1 241 264 24 20 4 0 83.3 FWR2 265 315 51 48 3 0 94.1 CDR2 316 336 24 15 6 3 62.5 FWR3 337 450 114 106 8 0 93 CDR3 (V region only) 451 454 4 4 0 0 100 Total N/A N/A 292 265 23 4 90.8 Then we split out the lines inside it in a second scanning step - less optimal but easier to read: FWR1 167 240 75 72 2 1 96 CDR1 241 264 24 20 4 0 83.3 FWR2 265 315 51 48 3 0 94.1 CDR2 316 336 24 15 6 3 62.5 FWR3 337 450 114 106 8 0 93 CDR3 (V region only) 451 454 4 4 0 0 100 into: (Section, from, to, length, matches, mismatches, gaps, percent identity) =head3 Method 2: Use the table supplied The other way to do this is to split the graphical markup out of the alignment. This works for _any_ reported alignment, not just the top hits: In the main alignment table processing section collect the information, collect the information: #Is region mark-up: if ($#InfoColumns == -1 && $#AlignmentColumns ==0) { # print ": Region Markup detected\n"; $RegionMarkup = $RegionMarkup.$AlignmentPanel; #Collect the information, then re-synthesise it at the end of record next; } Then afterwards when all the region was collected, process it like this: #Pad the CDER3 region: #Remove the trailing spaces: $RegionMarkup =~ s/ *$//g; #Calculate the length of the CDR3 region so we can add it in: my $CDR3PaddingNeeded = ($Query_End-$Query_Start)-length ($RegionMarkup) -length ("<-CDR3>")+1; #Build up the CDR3 region, the 'x' operator is very helpful here (implict foreach loop): $RegionMarkup = $RegionMarkup."<-CDR3"."-" x $CDR3PaddingNeeded. ">"; #print "D: Need to pad with:'$CDR3PaddingNeeded' characters\n"; #Now really process it: my $C_Pos = 0; my @Domains = split (/(<*-*...[123]-*>*)/,$RegionMarkup); # foreach my $C_Domain (@Domains) { if (length ($C_Domain) <=0) {next;} my $DomainStart= $C_Pos; my $DomainEnd = $DomainStart + length ($C_Domain)-1; my ($DomainType) = $C_Domain =~ m/(...[123])/; # print "D: $DomainType \t($DomainStart-$DomainEnd=",$DomainEnd-$DomainStart,"):\t$C_Domain\n"; $DomainBoundaries{$DomainType}{"Start"} = $DomainStart; $DomainBoundaries{$DomainType}{"End"} = $DomainEnd; $C_Pos = $DomainEnd+1; } The two pieces of code are interchangable; the table version as used below, is neater, easier to understand and works nicely. Why stress? =head3 The end of the FWR3 is the start of CDR3? This is an assumption made. Hence the two variables: my $MaxDomainReported =0 ; # In nts / bps my $FWR3_Found_Flag = 0; # Did we find the end of the FWR3 - which is the start of the CDR3. Set to 'false' initially. $MaxDomainBaseFound =cut my $MaxDomainBaseFound =0 ; # In nts / bps my $AlignmentStart ; # In nts /bp #Alternative name would be: '$MinDomainBaseFound'; set to null until primed # my $FWR3_Found_Flag = 0; # Did we find the end of the FWR3 - which is the start of the CDR3. Set to 'false' initially. (my @StructureSummaryTable) = returnLinesBetween (\@Lines, "Alignment summary", "Total" ); #Enable the next line if you want the raw data we are going to parse in this section: #print Dumper @StructureSummaryTable; foreach my $C_Section (@StructureSummaryTable) { my ($DomainType, $DomainStart, $DomainEnd, $SectLength, $Matches, $Mismatches, $Gaps, $PID) = split (/\t+/,$C_Section); #print "D: Domain type: '$DomainType'\n"; #$DomainType =~ s/ .*$//g; $DomainBoundaries{$DomainType}{"Start"} = $DomainStart; $DomainBoundaries{$DomainType}{"End"} = $DomainEnd; #So we can do a reality check on the length / start of the CDR3 if we have to go looking: if ($MaxDomainBaseFound <= $DomainEnd) { $MaxDomainBaseFound = $DomainEnd; } #Store the maximum base found if ($AlignmentStart eq undef or $AlignmentStart >= $DomainStart) { $AlignmentStart = $DomainStart; } } #print Dumper %DomainBoundaries; #die "HIT BLOCK\n"; =head3 Did we find the CDR3 region specifically? If we did fine; otherwise try to find it using the FWR3 region if we found that; otherwise give up. =cut print "D: Minimum base marked-up ($AlignmentStart) - aka. \$AlignmentStart; maximum: ($MaxDomainBaseFound)\n"; #my @WantedSections = qw (V D J); =head2 Second Stanza: Parse the main Alignment Table =head3 Get the table, then determine the character at which to split the 'Info' & 'Alignment' panels. As this is a little involved and comparamentalises nicely we sub-contract this to two functions"" (my @Table) = returnLinesBetween (\@Lines, "Alignment", "Lambda" ); my $PanelSplitPoint = findSplitPoint (\@Table); #Why can't they just use a fixed field width or a tab as a delimiter? =cut (my @Table) = returnLinesBetween (\@Lines, "Alignment", "Lambda" ); my $PanelSplitPoint = findSplitPoint (\@Table); #Why can't they just use a fixed field width or a tab as a delimiter? #If you are interested, enable this line: # print "D: The info panel was detected at: '$splitPoint'\n"; =head3 =cut foreach my $C_Line (0..$#Table) { =head3 Call the line type we find: There are 4: These are distinguished by the number of fields (one or mores spacer is a field separator) in the Info & Alignment Panels (see values in brackets) | <- This split is ~40 chars. from the start of the line * InfoPanel * | * Alignment Panel * : is a "Blank" line (-1,-1) <----FWR1--><----------CDR1--------><-----------------------FWR2------ : is "Region Markup" (-1,0) W A A S G F T F N T Y A V H W V R Q A P G K G : is "AA Sequence" (-1, >=0) Query_1 27 TGGGCAGCCTCTGGATTCACCTTCAATACCTATGCTGTGCACTGGGTCCGCCAGGCTCCAGGCAAGGGGC 96 : is "DNA Sequence" (2,1) V 95.3% (222/233) IGHV3-30*04 64 ..T......................G..G.......A................................. 133 : is "" " So we split 40 chars in and then the two parts on spaces. =cut # print "D: (sub) Line in parsed table: '$C_Line': \n"; my ($InfoPanel, $AlignmentPanel) = $Table[$C_Line] =~ /^(.{$PanelSplitPoint})(.*)$/; my @InfoColumns = split (/\s+/,$InfoPanel); my @AlignmentColumns = split (/\s+/,$AlignmentPanel); #If you want to see how the line is being split enable either of these next two lines; the 2nd is more detailed than the first # print "D: Line: $C_Line/t Number of Columns (Info, Alignment): \t$#InfoColumns \t $#AlignmentColumns\n"; # print "D: For '$C_Line' \t line in the table there are parts: '$InfoPanel' [$#InfoColumns], '$AlignmentPanel [$#AlignmentColumns]'\n"; #Populate this so we can step through it =head4 Is a blank line: =cut if ($#InfoColumns == -1 && $#AlignmentColumns == -1) { # print ": Blank\n"; next; } #For now I think we just skip - is not needed (though might be implict mark-up) =head4 Is region mark-up: =cut if ($#InfoColumns == -1 && $#AlignmentColumns ==0) { # print ": Region Markup detected\n"; $RegionMarkup = $RegionMarkup.$AlignmentPanel; #Collect the information, then re-synthesise it at the end of record next; } =head4 Is query DNA Sequence: =cut if ($#InfoColumns == 2 && $#AlignmentColumns ==1) { # print ": DNA Query Sequence\n"; #Detect the two coordinatates of alignment against the query sequence: (last two numbers of the two 'panels') ($ThisQueryStart) = $InfoPanel =~ / (\d+) *$/; ($ThisQueryEnd) = $AlignmentPanel =~ / (\d+) *$/; my ($ThisDNASeq) = $AlignmentPanel =~ /^(.*?) /; #If you want to know what we just found: #print "D: This DNA Sequence: '$ThisDNASeq'\n"; $DNAQuerySequence = $DNAQuerySequence. $ThisDNASeq; #Add it on to whatever we already have. #Move the needle if there are smaller / greater; otherwise prime the 'needles': if ($ThisQueryStart < $Query_Start or $Query_Start == -1) { $Query_Start = $ThisQueryStart; } if ($ThisQueryEnd > $Query_End or $Query_End == -1) { $Query_End = $ThisQueryEnd; } # print ": Query DNA Sequence detected This line: ($ThisQueryStart, $ThisQueryEnd) & Maximally: ($Query_Start, $Query_End)\n"; next; } =head4 Is AA Sequence: This is complicated as it Need to decide whether this is the sequence of the read or that of the original V / D / J regions: --------------> A V Y Y C T R D M S P I M T T F A G N Y W G Q << Want this Query_1 237 GCTGTTTATTACTGTACGAGAGATATGAGCCCCATCATGACAACGTTTGCCGGAAACTACTGGGGCCAGG 306 V 95.3% (222/233) IGHV3-30*04 274 .....G.........G.......----------------------------------------------- 296 A V Y Y C A R V 95.7% (221/231) IGHV3-30-3*01 274 .....G.........G.....------------------------------------------------- 294 ...etc... G T L V T V S S << Want this Query_1 307 GAACCCTGGTCACCGTCTCCTCAG 330 To solve this we peak at the next line that it has the tag "Query" in it (we assume the line exists...) =cut if ($#InfoColumns == -1 && $#AlignmentColumns >=-1) { unless ($Table[$C_Line+1] =~ /Query/) { next; } #Is the next line the DNA sequence ? # # print ": AA sequence\n"; $CurrentAASequence = $AlignmentPanel; #print "D: Panel Split Point = $PanelSplitPoint, '$AlignmentPanel'\n"; $CurrentAASequence =~ s/^ {$PanelSplitPoint}//; #print "D: '$AAQuerySequence'\n"; # print "D: Current AA Sequence: \t'$CurrentAASequence'\n"; $AAQuerySequence = $AAQuerySequence.$CurrentAASequence; #Store the elongating AA Sequence as well next; } =head4 Is Alignment: =cut if ($#InfoColumns == 4 && $#AlignmentColumns ==1) { #Not acutally interesting to us for this version of the parser. Delete ultimately? next; } #Is weird! Don't recognise it! warn "Weird! Don't recongnise this: '$ID' [$#InfoColumns,$#AlignmentColumns]// '",$Lines[$C_Line],15,"...'\n"; } #End main iteration loop for alignment parsing. =head2 The CDR3 is noted as problematic. Can we identify it? =cut print "D: Starting Search for CDR3\n"; #Do have the end of the FWR3 but not the CDR3? If so then it is worth trying to find the CDR3, otherwise...nothing we can do at this point if (exists ($DomainBoundaries{"FWR3"}{"End"}) && $AlignmentStart !=0 && not (exists $DomainBoundaries{"CDR3"}{"End"}) ) #Guess we need to go looking for the end then... { #print "D: Placing call to markUpCDR3\n"; my ($CDR3_Start, my $CDR3_End, my $CDR3_Found_Tag) = markUpCDR3 ($DNAQuerySequence, $AAQuerySequence, $DomainBoundaries{"FWR3"}{"End"}, $AlignmentStart, $DNACDR3_Pat, $AACDR3_Pat); if ($CDR3_Start !=0 && $CDR3_End !=0) { $DomainBoundaries{"CDR3"}{"Start"} = $CDR3_Start; $DomainBoundaries{"CDR3"}{"End"} = $CDR3_End ; $DomainBoundaries{"CDR3"}{"FoundHow"} = $CDR3_Found_Tag; print "D: CDR3 was found by pattern matching: '$CDR3_Found_Tag' ($CDR3_Start, $CDR3_End)\n"; } else { print "D: CDR3 was not found [either by igBLAST or by pattern matching]\n"; $DomainBoundaries{"CDR3"}{"FoundHow"} = "NOT_FOUND"; } } else { #Was reported by igBLAST print "D: Found the FWR3 from the Domain Boundary Table\n"; $DomainBoundaries{"CDR3"}{"FoundHow"} = "IGBLAST_NATIVE"; } #print Dumper %DomainBoundaries; =head2 Get the top VDJ regions: =cut =head2 Extract General Features: =cut (my $TopHit) = $_ =~ m/V-J Frame, Strand\):\n(.*?)\n/s; print "D: Top Hits (raw)= '$TopHit' \n"; my ($Top_V_gene_match, $Top_D_gene_match, $Top_J_gene_match, $Chain, $VJFrame, $Strand) = split (/\t/,$TopHit); print "D: Top Hits (parsed)= '$Top_V_gene_match, $Top_D_gene_match, $Top_J_gene_match, $Chain, $VJFrame, $Strand'\n"; =head2 Store the V / D / J Genes used =cut if (defined $Top_V_gene_match && $Top_V_gene_match ne "") { $OUTPUT_Data{"Top V Gene"} = $Top_V_gene_match; } if (defined $Top_D_gene_match && $Top_D_gene_match ne "") { $OUTPUT_Data{"Top D Gene"} = $Top_D_gene_match; } if (defined $Top_J_gene_match && $Top_J_gene_match ne "") { $OUTPUT_Data{"Top J Gene"} = $Top_J_gene_match; } if (defined $Strand && $Strand ne "") { $OUTPUT_Data{"Strand"} = $Strand;} =head4 Preamble: ID, Frame, and V / D / J used: =cut #Do a reality check: if we didn't get an ID, then skip: unless (defined (defined $ID) && $ID ne "" && defined $VJFrame && $VJFrame ne "") { print &printOUTPUTData (\%OUTPUT_Data)."\n"; next; } #Ok, so we have data...most likely: #print "OUTPUT:\t",join ("\t", $ID, $VJFrame, $Top_V_gene_match, $Top_D_gene_match, $Top_J_gene_match); if (defined $VJFrame && defined $ID && $VJFrame ne "" && $ID ne "") { $OUTPUT_Data{"VDJ Frame"} = $VJFrame;} else { print &printOUTPUTData (\%OUTPUT_Data)."\n"; next; }#REALLY? We didn't find anything? Oh well, move to next record =head4 CDR1 =cut #Remember that the alignment starts at the FWR1 start, not nt =0 on the read, hence we substract this off all future AA (& DNA coordinates) my $AlignmentOffset = $DomainBoundaries{"FWR1"}{"Start"}; # print "D: AA Seqeunce is: '$AAQuerySequence'\n"; if (exists $DomainBoundaries{"CDR1"}{"Start"}) #It is very possible that it doesn't; assume the End does though if we find the Start { # my $VRegion = $Alginments{"V"}{$C_VRegion}; #Convenience.... my $CDR1Start = $DomainBoundaries{"CDR1"}{"Start"}; my $CDR1End = $DomainBoundaries{"CDR1"}{"End"}; my $CDR1_Length = $CDR1End - $CDR1Start; # print "D: CDR1 $CDR1Start $CDR1End = $CDR1_Length\n"; #Remember that the alignment starts at the FWR1 start, not nt =0 on the read my $CDR1_Seq_AA = substr ($AAQuerySequence, $CDR1Start - $AlignmentOffset, $CDR1_Length); # print "D: '$CDR1_Seq_AA'\n"; $CDR1_Seq_AA =~ s/ //g; my $CDR1_Seq_AA_Length = length ($CDR1_Seq_AA); #Add this data to the output store specifically: $OUTPUT_Data{"CDR1 Seq"} = $CDR1_Seq_AA; $OUTPUT_Data{"CDR1 Length"} = $CDR1_Length; } #What happens if there is no CDR1 found? Leave blank - the output routine can handle this =head4 CDR2 =cut if (exists $DomainBoundaries{"CDR2"}{"Start"}) #It is very possible that it doesn't; assume the End does though if we find the Start { # my $VRegion = $Alginments{"V"}{$C_VRegion}; #Convenience.... my $CDR2Start = $DomainBoundaries{"CDR2"}{"Start"}; my $CDR2End = $DomainBoundaries{"CDR2"}{"End"}; my $CDR2_Length = $CDR2End - $CDR2Start; my $CDR2_Seq_AA = substr ($AAQuerySequence, $CDR2Start - $AlignmentOffset , $CDR2_Length); $CDR2_Seq_AA =~ s/ //g; my $CDR2_Seq_AA_Length = length ($CDR2_Seq_AA); #Add this data to the output store specifically: $OUTPUT_Data{"CDR2 Seq"} = $CDR2_Seq_AA; $OUTPUT_Data{"CDR2 Length"} = $CDR2_Length; } #What happens if there is no CDR2 found? Leave blank - the output routine can handle this. =head4 CDR3 =cut if (exists $DomainBoundaries{"CDR3"}{"Start"}) #It is very possible that it doesn't; assume the End does though if we find the Start { # my $VRegion = $Alginments{"V"}{$C_VRegion}; #Convenience.... my $CDR3Start = $DomainBoundaries{"CDR3"}{"Start"}; my $CDR3End = $DomainBoundaries{"CDR3"}{"End"}; my $CDR3_Length = $CDR3End - $CDR3Start; # This variable isn't used - delete it when safe to do so my $CDR3_Seq_AA = substr ($AAQuerySequence, $CDR3Start - $AlignmentOffset, $CDR3_Length); my $CDR3_Seq_DNA = substr ($DNAQuerySequence, $CDR3Start - $AlignmentOffset, $CDR3_Length); $CDR3_Seq_AA =~ s/ //g; $CDR3_Seq_DNA =~ s/ //g; my $CDR3_Seq_AA_Length = length ($CDR3_Seq_AA); my $CDR3_Seq_DNA_Length = length ($CDR3_Seq_DNA); #Add this data to the output store specifically: $OUTPUT_Data{"CDR3 Seq"} = $CDR3_Seq_AA; $OUTPUT_Data{"CDR3 Length"} = $CDR3_Seq_AA_Length; $OUTPUT_Data{"CDR3 Seq DNA"} = $CDR3_Seq_DNA; $OUTPUT_Data{"CDR3 Length DNA"} = $CDR3_Seq_DNA_Length; #And in the case of the CDR3 how we found it: $OUTPUT_Data{"CDR3 Found How"} = $DomainBoundaries{"CDR3"}{"FoundHow"}; } #What happens if there is no CDR3 found? Leave blank - the output routine can handle this. #die "HIT BLOCK\n"; #End of the record; output the data we have collected and move on. print &printOUTPUTData (\%OUTPUT_Data)."\n"; } ############ sub returnLinesBetween { =head3 SUB: returnLinesBetween ({reference to array Index array}, {regex for top of section}, {regex for bottom of section}) When passed a reference to an array and two strings - interpreted as REGEX's - will return the lines of the Array that are bounded by these tags. If either of the tags are not found - or are found in the wrong order - then a null list is returned. =cut my ($Text_ref, $TopTag, $BotTag) = @_; my @Table; #The two boundary conditions at which we will cut the table: #print "D: [returnLinesBetween]: '$TopTag, $BotTag'\n"; #How we record these: my $AlignmentLine_Top=0; my $AlignmentLine_Bot=0; my $LineIndex=-1; #-1 As the loop increments this line counter first, then does its checks. #If you care: #print "D: Lines of text passed: $$#Lines\n"; #Iterate through until we find what we are looking for or run out of text to search: while (($AlignmentLine_Bot ==0 or $AlignmentLine_Top==0) && $LineIndex <=$#{$Text_ref}) { $LineIndex++; #Enable if you need to care: # print "D: Line Index = $LineIndex\n"; if ($$Text_ref[$LineIndex] =~ m/$TopTag/) { $AlignmentLine_Top = $LineIndex; # print "D: [returnLinesBetween]: TopTag found in Line: '$$Text_ref[$LineIndex]'\n"; #Enable if you are interested } if ($$Text_ref[$LineIndex] =~ m/$BotTag/) { $AlignmentLine_Bot = $LineIndex; # print "D: [returnLinesBetween]: Bottom Tag found in Line: '$$Text_ref[$LineIndex]'\n"; #Enable if you are interested } } #Reality check: did we find anything? If not then we return null. if ($AlignmentLine_Top ==0 && $AlignmentLine_Bot ==0) { return; } #Again, enable if you care: #print "D: [returnLinesBetween] Lines for section table: '$AlignmentLine_Top to $AlignmentLine_Bot'\n"; #We want the lines one down and one up - so polish these. $AlignmentLine_Top++; $AlignmentLine_Bot--; #Return as an array slice: return (@$Text_ref[$AlignmentLine_Top .. $AlignmentLine_Bot]); } ############ sub findSplitPoint { =head2 sub: $PanelBoundaryCahracter = findSplitPoint (\@Table) When passed a table with the alignment in it makes an educated guess as to the precise split point to spearate the 'info' and 'alignment' panels. This is a right olde faff because the field / panel boundaries change. ' Query_6 167 GAGGTGCAGTTGTTGGAGTCTGGGGGAGGCTTGGCACAGCC-GGGGGGTCCCTGAGACTCTCCTGTGCAG 235' ' Query_6 236 CCTCTGGATTCACCTTTGACAAATATGCCATGACCTGGGTCCGCCAGGCTCCAGGGAAGGGTCTGGAGTG 305' ' Query_6 306 GGTCTCAACTATACTTGCCAGTGGTCG---CACAGACGACGCAGACTCCGTGAAGGGCCGGTTTGCCATC 372' ' Query_6 373 TCCAGAGACAATTCCAAGAACACTCTGTATCTGCAAATGAACAGCCTGAGAGTCGAGGACACGGCCCTTT 442' ' Query_6 443 ATTACTGTGCGAGTGAGGGGGACATAGTGGCTTCGGAGCTTTTGAGTACTGGGGCCAGGGAAACCTGGTC 512' MOTIF_FOUND_IN_AA i.e to contain just ATGC + "X" bases & the gap "-" character but not the "." character (found in the alingment proper) and have 4 fields in total Returns either -1 or the location of the panel boundary, issues a warning and returns -1 if is the most frequent boundary because the pattern match has been failing more often that it suceeded. =cut #A rough guess is 38 for normal sequences, 48 for reversed ones: my $SplitPos = 0; (my $Table_ref) = @_; #Get the reference to the table my @DNALines; #We populate this for mining in the next section foreach my $C_Line (@{$Table_ref}) { #print "D: $C_Line\n"; # (my $SplitLine) = $C_Line; #Split on consecutive tabs or spaces: my @LineFields = split (/[\t\s]+/,$C_Line); #print "D: Split Line: '",join (",",@LineFields),"' : $#LineFields\n"; unless ( $LineFields[3] =~ m/[^\.]/ && $LineFields[3] =~ m/[ATGCX]{20,}/ && $#LineFields==4) { next; } #Enable if you want to know the lines we think are the DNA Query strings: #print "D: DNA Line: '$C_Line'\n"; push @DNALines, $C_Line; #Note it down } my %PanelBounds; #Will contain the positions of the panel boundaries foreach my $C_DNALine (@DNALines) { #print "D: '$C_DNALine'\n"; $C_DNALine =~ m/[ATGC-]+ \d+$/; #Match the DNA string and the indexingMOTIF_FOUND_IN_AA numbers afterwards, allow gap characters. my $MatchPos = $-[0]; #This is the position of the start of the last match because we can't get the index() function to work #(my $MatchPos) = index ($C_DNALine, / [ATGCX-]{20}/,0); #print "D: '$C_DNALine' DNA panel starts at:'$MatchPos'\n"; $PanelBounds{$MatchPos}++; } #Sort the hash values in order and then return the most frequent (will offer some resistance to the occasion pattern failure) #The brackets around "($SplitPos)" are really necessary it seems. ($SplitPos) = (sort { $a <=> $b } keys %PanelBounds); #If you want #print Dumper %PanelBounds; #Tell people if we are having difficultlty: if ($SplitPos == -1) { warn "Couldn't identify the panel boundaries\n"; } #print "D: $SplitPos: Returning the split position of: '$SplitPos'\n"; return $SplitPos; } ## # # ### ##### # # ##### sub markUpCDR3 { =head3 Sub: (Start, End, Found How) = markUpCDR3 (DNASeq, AASeq, FWR3 End, FWR1 Offset, DNA Regex, AA Regex) Tries to identify the end of the CDR3 using the DNA and RNA Sequence patterns MOTIF_FOUND_IN_AAsupplied. The CDR3 is assumed to start at the end of the FWR3. To reduce FP matches only the sequences (DNA & AA) after the FWR3 are tested with the pattern. The position of the first matching pattern is reported. =head4 Fuller Usage: my ($CDR3_Start, my $CDR3_End) = markUpCDR3 ($DNAQuerySequence, $AAQuerySequence, $DomainBoundaries{"FWR3"}{"End"}, $DomainBoundaries{"FWR1"}{"Start"}, $DNACDR3_Pat, $AACDR3_Pat); =head4 Returned Values If the CDR3 was found then we we signal like this: $MotifFound ==0 : Nope, didn't find either motif $MotifFound ==1 : Found at the DNA level, not the AA level $MotifFound ==2 : Found at the the AA level, not the DNA level $MotifFound ==3 : Found at the the AA level & the DNA level (Also remember that if the FWR3 region couldn't be identified in the sequence there is a 4th option: not tested; this routine isn't called therefore) The Start and Ends returned are from the first sucessful match (MotifFound==3): though hopefully they are the same. Formally the test order is: 1) DNA 2) AA i.e. DNA bp locations have priority. Technically the locations are determined by a regex match then the $+[0] array (i.e. the end of the pattern match). See pages like this: http://stackoverflow.com/questions/87380/how-can-i-find-the-location-of-a-regex-match-in-perl for an explanation. =head3 Manipulation of AA patternsMOTIF_FOUND_IN_AA Note that patterns are assumed to require white space inserting in them between the letters. This could be a serious limitation =cut #Get the parameters passed: my ($DNA, $AA, $FWR3_End, $FWR1_Start, $DNAPat, $AAPat) = @_; print "D: markUpCDR3: Passed Parameters '$FWR3_End, $FWR1_Start, $DNAPat, $AAPat' (& AA & DNA sequence)\n"; #Setup our return values: my $Start = 0; my $End =0; my $MotifFound = 0; my $How; #Literally How the motif was found (or not if blank) =head4 Prepare the sequences and the patterns for use Specifically: trim off the start of the AA & DNA string already allocated to other CDRs or FWRs Add in spaces into the AA regex pattern because we can't get regex-ex freespacing mode i.e. "$Var =~ m/$AAPat/x" working. We take the "-1" as the CropPoint position to include the previous 3 nucleotides / AAs; remember to add this back on in position calculations. =cut #Because igBLAST doesn't always report from the start of the read (primers and things are upstream): my $CropPoint = $FWR3_End - $FWR1_Start - 1 ; #print "D: markUpCDR3: Crop point is: '$CropPoint'\n"; #print "D: markUpCDR3: Cropping point is: '$CropPoint' characters from start\n"; #We trim off the parts we expect to find the CDR3 motifs in leaving at extra 3nts on to allow for base miss-calling: my $AA_Trimmed = substr ($AA, $CropPoint); my $DNA_Trimmed = substr ($DNA ,$CropPoint); #print "D: markUpCDR3: AA = '$AA' (untrimmed)\nD: markUpCDR3: TR = '$AA_Trimmed' (Trimmed) ", length ($AA_Trimmed)," nts long\n"; #print "D: markUpCDR3: Testing: AA = '$AA_Trimmed', DNA = '$DNA_Trimmed'\n"; #This lovely hack is to account for the spaces in the AA sequence and we can't get the "$Var =~ m/$AAPat/x" working my $AAPat_Spaced; foreach my $C_Char (0..length($AAPat)-1) #The -1 is because we don't want trailing spaces until the next nt -> AA translation. { $AAPat_Spaced = $AAPat_Spaced.'\s+'.substr ($AAPat,$C_Char,1); } #And write this back into the main pattern we were passed: $AAPat = $AAPat_Spaced; #temp hack: #$AA_Trimmed = $AA; my $MotifFound=0; #So we can record which patterns we found my $MotifPositionDNA =-1; my $MotifPositionAA =-1; #print "D: markUpCDR3: Pattern: '$AAPat_Spaced'\n"; =head4 At DNA level: "TGG GGx xxx GGx" [+1] =cut #print "D: markUpCDR3: '$DNA_Trimmed' (Trimmed DNA string)\n"; if ($DNA_Trimmed =~ m/$DNAPat/) { $MotifPositionDNA = $+[0]; #Just the easiest way to do this in Perl # print "D: markUpCDR3:: Found Motif match on DNA at bp: '$MotifPositionDNA'\n"; $MotifFound = $MotifFound + 1; #Any more matches further on? my $LaterString = substr ($DNA_Trimmed, $MotifPositionDNA); # print "D: markUpCDR3: '$AA_Trimmed' (AA Trimmed string)\n"; # print "D: markUpCDR3: '", substr ($DNA_Trimmed,0, $MotifPositionDNA)," (DNA until pattern match string)\n"; # print "D: markUpCDR3: '$DNA_Trimmed' (Trimmed DNA string)\n"; # print "D: markUpCDR3: '$LaterString' (Later part of DNA string)\n"; if ($LaterString =~ m/$DNAPat/) { print "D: markUPCDR3: Also got a match further down the DNA String: at ", $-[0] ," to ", $+ [0], " - which might be worrying\n"; } } =head4 At AA level: "WGxG" [+2] =cut if ($AA_Trimmed=~ m/$AAPat/) { $MotifPositionAA = $+[0]; #Just the easiest way to do this in Perl $MotifFound = $MotifFound + 2; # print "D: markUpCDR3: Found Motif match on AA at position (on DNA remember): '$MotifPositionAA' (ie.)\n"; (my $CDR3_seq) = substr ($AA_Trimmed, 0, $MotifPositionAA); # print "D: markUpCDR3: Seq ='$CDR3_seq' - as detected\n"; } =head4 Assess the results of motif position finding =cut #print "D: markUpCDR3: MotifFound = '$MotifFound'\n"; if ($MotifFound ==0) { return ($Start, $End, $MotifFound); } #The easy one really: return we didn't find the CDR3 # $Start = $FWR3_End; #We assume the end of the FWR3 is the start of CDR3: #Just found in DNA: if ($MotifFound ==1) { $Start = $FWR3_End; #We assume the end of the FWR3 is the start of CDR3: $End = $MotifPositionDNA; $How = "MOTIF_FOUND_IN_DNA"; } #Just found in AA: if ($MotifFound ==2) { $End = $MotifPositionAA; $How = "MOTIF_FOUND_IN_AA"; } #Found in both, DNA has priority: if ($MotifFound ==3) { $Start = $FWR3_End ; #We assume the end of the FWR3 is the start of CDR3: $End = $MotifPositionDNA; $How = "MOTIF_FOUND_IN_BOTH"; } #print "D: markUpCDR3: Motif found = $MotifFound\n"; =head4 These next few lines are for testing / diagnostics only - disable for general use If you are interested in getting the CDR3 directly then remember the main coordinate system is defined such that the start of FWR1 is unlikely to be at nt 1. =cut $Start = $FWR3_End - $FWR1_Start -1; $End = $End + $CropPoint; my $CDR3_RegionLength = $End - $Start; #print "D: markUpCDR3: CDR3 Length= $Start - $End = '$CDR3_RegionLength'\n"; (my $CDR3_seq) = substr ($AA, $Start, $CDR3_RegionLength); #Add onto the coordinates what we trimmed off: #print "D: markUpCDR3: Seq ='$CDR3_seq'\n"; print "D: markUpCDR3: returning: $Start, $End, $How, ($MotifFound) [NB: offset of :'+ $FWR1_Start'\n"; #die "HIT BLOCK\n"; return ($Start + $FWR1_Start, $End + $FWR1_Start, $How); } sub printOUTPUTData { =head2 sub: $OutputDataString = printOUTPUTData {\%OutputData} When passed an array containing the appropriate CDR, Top V / D/ J genes and the seqeunce ID. This prepared and then returned as a text string that can then be printed to STDOUT: print (printOUTPUTData (\%OutputData)); Any missing data in the Hash array it polietly ignored and a null string printed in place. The text field is tab delimited; there are no extra trailing tabs or carriage returns in place. Actually the fields printed out are stored in an index array. =head3 Header output If the routine is passed a key 'HEADER' then the header columns are returned as that string. This is tested first - so don't add this unless you mean to. =cut my @HeaderFields = ("ID", "VDJ Frame", "Top V Gene", "Top D Gene", "Top J Gene", "CDR1 Seq", "CDR1 Length", "CDR2 Seq", "CDR2 Length", "CDR3 Seq", "CDR3 Length", "CDR3 Seq DNA", "CDR3 Length DNA", "Strand", "CDR3 Found How"); my $OutputString = "OUTPUT:"; #What we are going to build the output into. =head4 Print Header & Exit? =cut my ($Data_ref) = @_; #print "D: printOUTPUTData: Running\n"; if (exists $$Data_ref {"HEADER"}) { $OutputString .= "\t"; for(my $n = 0; $n <= $#HeaderFields; $n++) { $OutputString .= $HeaderFields[$n]; $OutputString .= "\t" if($n < $#HeaderFields); } # foreach my $C_Header (@HeaderFields) # { $OutputString .= "$C_Header"; } # print "D: printOUTPUTData: HEADER Printout requested '@HeaderFields'\n"; return ($OutputString); } =head3 Assemble whatever data we have - and tab delimit the null fields =cut #print "D: printOUTPUTData: Will pretty print this:\n", Dumper $Data_ref; foreach my $C_Header (@HeaderFields) { if (exists ($$Data_ref {$C_Header})) { $OutputString .= "\t". $$Data_ref{$C_Header}; } #We have data to print out else { $OutputString .="\t"; } #Add a trailing space } # return ($OutputString); } ######################################### Code Junk ######################## =head2 Code Junk Attic =head3 Demonstrates how to reverse translate an amino acid sequence into DNA: use Bio::Tools::CodonTable; use Bio::Seq; # print possible codon tables my $tables = Bio::Tools::CodonTable->tables; while ( (my $id, my $name) = each %{$tables} ) { print "$id = $name\n"; } my $CodonTable = Bio::Tools::CodonTable->new(); my $ExampleSeq = Bio::PrimarySeq->new(-seq=>"WGxG", -alphabet => 'protein') or die "Cannot create sequence object\n"; my $rvSeq = $CodonTable->reverse_translate_all($ExampleSeq); print "D: '$rvSeq'\n"; die "TEST OVER\n"; =cut =head3 For processing the 'Alignment lines' section of the alginment table #If we are ever interested; then enable the code below: # print ": Alignment\n"; # $InfoPanel =~ s/^ +//; $InfoPanel =~ s/ +$//; #Clean off trailing spaces # my ($Germclass, $PID, $PID_Counts, $Allele) = split (/\s+/,$InfoPanel); #Split on spaces ##Enable if you need to know what we just found: # #print "D: Fields are (Germclass, PID, PID_Counts, Allele) \t$Germclass, $PID, $PID_Counts, $Allele\n"; # #A reality check: we should have an Allele - or some text here. # unless (defined $Allele && $Allele ne "") # { warn "Cannot get Allele for Line '$C_Line' - implies improper parsing: '",substr ($Lines[$C_Line],0,15),"...'\n"; } # if (exists ($Alginments {$Germclass}{$Allele})) # { $Alginments {$Germclass}{$Allele} = $Alginments {$Germclass}{$Allele}.$CurrentAASequence; } #Carry on adding # else #more work needed as we need to 'pad' the sequence with fake gap characters) # { ##Do we still need this padding? I don't think so # # # my $PaddingChars = ($ThisQueryStart-$Query_Start); # print "D: New gene found: need to pad it with ($ThisQueryStart-$Query_Start) i.e. '$PaddingChars' characters\n"; # #To help testing, calculate this first: # my $PaddingString = " "x $PaddingChars; # $Alginments {$Germclass}{$Allele} = $CurrentAASequence; # } # next =head3 Demonstration of Pattern match positions my $Text = "12345TTT TTAAAAA"; my $TestPat = "TTT\\s+TT"; (my $Result)= $Text =~ m/$TestPat/; print "D: Two vars are: - = ",$-[0], " & + =", $+[0]," for test pattern '$TestPat'\n"; sub printCDR3 { =head3 Subroutine: printCDR3 ($CDR3_Start, $CDR3_End, "SUMMARY_TABLE", $AAQuerySequence, $DNAQuerySequence); ???? IS THIS FUNCTION IN USE ????? Handles the printing of the output when passed information about the CDR3 region. The result is sent returned as a text string in this version hence use it like this if you want to send it to STDOUT: print printCDR3 ($CDR3_Start, $CDR3_End, "SUMMARY_TABLE", $AAQuerySequence, $DNAQuerySequence), "\n"; #=cut #Despite the similarity in names, these are all local copies passed to us: my ($Start, $End, $Tag, $FullAAQuerySequence, $FullDNAQuerySequence) = @_; #For DNA: my ($CDR_DNA_Seq) = substr ($FullDNAQuerySequence, $Start, $Start+$End); my ($CDR_DNA_Length) = length ($CDR_DNA_Seq); #For AA: my ($CDR_AA_Seq) = substr ($FullAAQuerySequence, $Start, $Start+$End); my ($CDR_AA_Length) = length ($CDR_AA_Seq); my $ReturnString = join ("\t", $CDR_DNA_Seq, $CDR_DNA_Length, $CDR_AA_Seq, $CDR_AA_Length, $Tag); #Create here so we can inspect it / post process it if needed: print "D: SUB: printCDR3: As returned: '$ReturnString'\n"; return ($ReturnString); } =cut =head2 Change Log =head3 Version 1.2 1) Fixed the 'Process recrod request' feature' [was failed increment in $Record] 2) Deleted / Deactivated the function 'printCDR3' [wasn't in used; kept if useful for parts]. This function is replaced by the more general printOUTPUTData() 3) A tag for the CDR3 status is now output for every record / read. Initially this is set to "NOT_FOUND" and changed if evidence for the CDR3 is found. =head4 Version 1.3 1) The tophit line was split on whitespace, however sometimes the VJFrame is something like “In-frame with stop codon”, which means the line is also split on the spaces therein. It now splits on tabs only, and this seems to work properly. - found by Bas Horsman. =head4 Version 1.3a 1) "MOTIF_FOUND_IN_AA" reported correctly (was impossible previously due to addition error to the $MotifFound var (never could == 3) =cut =head4 Version 1.4 1) Now processes files using Mac/Unix/MS-DOS newline characters: $_ =~ s/\r\n/\n/g; #In case line ends are MS-DOS $_ =~ s/\r/\n/g; #In case line ends are Mac #The whole record - one per read - is now stored in $_ my @Lines =split (/\R/,$_); #Split on new lines =head4 Version 1.4a 1) Fixed the length of the CDR3 AA string being reported correctly: $OUTPUT_Data{"CDR3 Length"} = $CDR3_Length; to: $OUTPUT_Data{"CDR3 Length"} = $CDR3_Seq_AA_Length;