", file=main.html, append=T)
if(empty.region.filter == "leader"){
cat("FR1+CDR1+FR2+CDR2+FR3+CDR3 sequences that show up more than once", file=main.html, append=T)
@@ -305,7 +306,11 @@
names(NTresult) = c(tmp, paste(clazz, c("x", "y", "z"), sep=""))
}
-write.table(NToverview[,c("Sequence.ID", "best_match", "seq", "A", "C", "G", "T")], NToverview.file, quote=F, sep="\t", row.names=F, col.names=T)
+NToverview.tmp = NToverview[,c("Sequence.ID", "best_match", "seq", "A", "C", "G", "T")]
+
+names(NToverview.tmp) = c("Sequence.ID", "best_match", "Sequence of the analysed region", "A", "C", "G", "T")
+
+write.table(NToverview.tmp, NToverview.file, quote=F, sep="\t", row.names=F, col.names=T)
NToverview = NToverview[!grepl("unmatched", NToverview$best_match),]
diff -r 05c62efdc393 -r a24f8c93583a shm_clonality.htm
--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/shm_clonality.htm Thu Dec 22 09:39:27 2016 -0500
@@ -0,0 +1,144 @@
+
+
+
+
+
+
+
+
+
+
+
+
+
+
References
+
+
Gupta,
+Namita T. and Vander Heiden, Jason A. and Uduman, Mohamed and Gadala-Maria,
+Daniel and Yaari, Gur and Kleinstein, Steven H. (2015). Change-O: a toolkit for analyzing large-scale B cell
+immunoglobulin repertoire sequencing data: Table 1. In Bioinformatics, 31 (20), pp.
+33563358. [doi:10.1093/bioinformatics/btv359][Link]
+
+
+
+
All, IGA, IGG, IGM and IGE tabs
+
+
In
+these tabs information on the clonal relation of transcripts can be found. To
+calculate clonal relation Change-O is used (Gupta et al, PMID: 26069265).
+Transcripts are considered clonally related if they have maximal three nucleotides
+difference in their CDR3 sequence and the same first V segment (as assigned by
+IMGT). Results are represented in a table format showing the clone size and the
+number of clones or sequences with this clone size. Change-O settings used are
+the nucleotide hamming distance substitution model with
+a complete distance of maximal three. For clonal assignment the first gene
+segments were used, and the distances were not normalized. In case of
+asymmetric distances, the minimal distance was used.
+
+
+
+
Overlap
+tab
+
+
This
+tab gives information on with which (sub)classe(s) each unique analyzed region
+(based on the exact nucleotide sequence of the analyzes region and the CDR3
+nucleotide sequence) is found with. This gives information if the combination
+of the exact same nucleotide sequence of the analyzed region and the CDR3
+sequence can be found in multiple (sub)classes.
+
+
Please note that this tab is based on all
+sequences before filter unique sequences and the remove duplicates based on
+filters are applied. In this table only sequences according more than once are
+included.
+
+
+
+
+
+
diff -r 05c62efdc393 -r a24f8c93583a shm_csr.htm
--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/shm_csr.htm Thu Dec 22 09:39:27 2016 -0500
@@ -0,0 +1,95 @@
+
+
+
+
+
+
+
+
+
+
+
+
+
+
The
+graphs in this tab give insight into the subclass distribution of IGG and IGA
+transcripts. Human Cµ, Cα, Cγ and Cε
+constant genes are assigned using a custom script
+specifically designed for human (sub)class assignment in repertoire data as
+described in van Schouwenburg and IJspeert et al, submitted for publication. In
+this script the reference sequences for the subclasses are divided in 8
+nucleotide chunks which overlap by 4 nucleotides. These overlapping chunks are
+then individually aligned in the right order to each input sequence. The
+percentage of the chunks identified in each rearrangement is calculated in the
+chunk hit percentage. Cα and Cγ
+subclasses are very homologous and only differ in a few nucleotides. To assign
+subclasses the nt hit percentage is calculated.
+This percentage indicates how well the chunks covering the subclass specific
+nucleotide match with the different subclasses. Information
+on normal distribution of subclasses in healthy individuals of different ages
+can be found in IJspeert and van Schouwenburg et al, PMID: 27799928.
+
+
IGA
+subclass distribution
+
+
Pie
+chart showing the relative distribution of IGA1 and IGA2 transcripts in the
+sample.
+
+
IGG
+subclass distribution
+
+
Pie
+chart showing the relative distribution of IGG1, IGG2, IGG3 and IGG4
+transcripts in the sample.
+
+
+
+
+
+
diff -r 05c62efdc393 -r a24f8c93583a shm_csr.r
--- a/shm_csr.r Tue Dec 20 09:03:15 2016 -0500
+++ b/shm_csr.r Thu Dec 22 09:39:27 2016 -0500
@@ -302,14 +302,14 @@
print("Plotting heatmap and transition")
png(filename=paste("transitions_stacked_", name, ".png", sep=""))
p = ggplot(transition2, aes(factor(reorder(id, order.x)), y=value, fill=factor(reorder(variable, order.y)))) + geom_bar(position="fill", stat="identity", colour="black") #stacked bar
- p = p + xlab("From base") + ylab("") + ggtitle("Mutations frequency from base to base") + guides(fill=guide_legend(title=NULL))
+ p = p + xlab("From base") + ylab("") + ggtitle("Bargraph transition information") + guides(fill=guide_legend(title=NULL))
p = p + theme(panel.background = element_rect(fill = "white", colour="black"), text = element_text(size=16, colour="black")) + scale_fill_manual(values=c("A" = "blue4", "G" = "lightblue1", "C" = "olivedrab3", "T" = "olivedrab4"))
#p = p + scale_colour_manual(values=c("A" = "black", "G" = "black", "C" = "black", "T" = "black"))
print(p)
dev.off()
png(filename=paste("transitions_heatmap_", name, ".png", sep=""))
p = ggplot(transition2, aes(factor(reorder(variable, -order.y)), factor(reorder(id, -order.x)))) + geom_tile(aes(fill = value)) + scale_fill_gradient(low="white", high="steelblue") #heatmap
- p = p + xlab("To base") + ylab("From Base") + ggtitle("Mutations frequency from base to base") + theme(panel.background = element_rect(fill = "white", colour="black"), text = element_text(size=13, colour="black"))
+ p = p + xlab("To base") + ylab("From Base") + ggtitle("Heatmap transition information") + theme(panel.background = element_rect(fill = "white", colour="black"), text = element_text(size=16, colour="black"))
print(p)
dev.off()
} else {
@@ -388,7 +388,7 @@
pc = pc + geom_bar(width = 1, stat = "identity") + scale_fill_manual(labels=genesForPlot$label, values=c("IGA1" = "lightblue1", "IGA2" = "blue4"))
pc = pc + coord_polar(theta="y") + scale_y_continuous(breaks=NULL)
pc = pc + theme(panel.background = element_rect(fill = "white", colour="black"), text = element_text(size=16, colour="black"), axis.title=element_blank(), axis.text=element_blank(), axis.ticks=element_blank())
- pc = pc + xlab(" ") + ylab(" ") + ggtitle(paste("IGA subclasses", "( n =", sum(genesForPlot$Freq), ")"))
+ pc = pc + xlab(" ") + ylab(" ") + ggtitle(paste("IGA subclass distribution", "( n =", sum(genesForPlot$Freq), ")"))
write.table(genesForPlot, "IGA_pie.txt", sep="\t",quote=F,row.names=F,col.names=T)
png(filename="IGA.png")
@@ -409,7 +409,7 @@
pc = pc + geom_bar(width = 1, stat = "identity") + scale_fill_manual(labels=genesForPlot$label, values=c("IGG1" = "olivedrab3", "IGG2" = "red", "IGG3" = "gold", "IGG4" = "darkred"))
pc = pc + coord_polar(theta="y") + scale_y_continuous(breaks=NULL)
pc = pc + theme(panel.background = element_rect(fill = "white", colour="black"), text = element_text(size=16, colour="black"), axis.title=element_blank(), axis.text=element_blank(), axis.ticks=element_blank())
- pc = pc + xlab(" ") + ylab(" ") + ggtitle(paste("IGG subclasses", "( n =", sum(genesForPlot$Freq), ")"))
+ pc = pc + xlab(" ") + ylab(" ") + ggtitle(paste("IGG subclass distribution", "( n =", sum(genesForPlot$Freq), ")"))
write.table(genesForPlot, "IGG_pie.txt", sep="\t",quote=F,row.names=F,col.names=T)
png(filename="IGG.png")
@@ -430,7 +430,7 @@
p = p + geom_point(aes(colour=best_match), position="jitter") + geom_boxplot(aes(middle=mean(percentage_mutations)), alpha=0.1, outlier.shape = NA)
p = p + xlab("Subclass") + ylab("Frequency") + ggtitle("Frequency scatter plot") + theme(panel.background = element_rect(fill = "white", colour="black"), text = element_text(size=16, colour="black"))
p = p + scale_fill_manual(values=c("IGA" = "blue4", "IGA1" = "lightblue1", "IGA2" = "blue4", "IGG" = "olivedrab3", "IGG1" = "olivedrab3", "IGG2" = "red", "IGG3" = "gold", "IGG4" = "darkred", "IGM" = "darkviolet", "IGE" = "darkorange", "all" = "blue4"))
-p = p + scale_colour_manual(values=c("IGA" = "blue4", "IGA1" = "lightblue1", "IGA2" = "blue4", "IGG" = "olivedrab3", "IGG1" = "olivedrab3", "IGG2" = "red", "IGG3" = "gold", "IGG4" = "darkred", "IGM" = "darkviolet", "IGE" = "darkorange", "all" = "blue4"))
+p = p + scale_colour_manual(guide = guide_legend(title = "Subclass"), values=c("IGA" = "blue4", "IGA1" = "lightblue1", "IGA2" = "blue4", "IGG" = "olivedrab3", "IGG1" = "olivedrab3", "IGG2" = "red", "IGG3" = "gold", "IGG4" = "darkred", "IGM" = "darkviolet", "IGE" = "darkorange", "all" = "blue4"))
png(filename="scatter.png")
print(p)
@@ -454,7 +454,7 @@
p = ggplot(frequency_bins_data, aes(frequency_bins, frequency))
p = p + geom_bar(aes(fill=best_match_class), stat="identity", position="dodge") + theme(panel.background = element_rect(fill = "white", colour="black"), text = element_text(size=16, colour="black"))
-p = p + xlab("Frequency ranges") + ylab("Frequency") + ggtitle("Mutation Frequencies by class") + scale_fill_manual(values=c("IGA" = "blue4", "IGG" = "olivedrab3", "IGM" = "darkviolet", "IGE" = "darkorange", "all" = "blue4"))
+p = p + xlab("Frequency ranges") + ylab("Frequency") + ggtitle("Mutation Frequencies by class") + scale_fill_manual(guide = guide_legend(title = "Class"), values=c("IGA" = "blue4", "IGG" = "olivedrab3", "IGM" = "darkviolet", "IGE" = "darkorange", "all" = "blue4"))
png(filename="frequency_ranges.png")
print(p)
diff -r 05c62efdc393 -r a24f8c93583a shm_csr.xml
--- a/shm_csr.xml Tue Dec 20 09:03:15 2016 -0500
+++ b/shm_csr.xml Thu Dec 22 09:39:27 2016 -0500
@@ -96,11 +96,11 @@
**Input files**
-IMGT/HighV-QUEST .zip and .txz are accepted as input files.
+IMGT/HighV-QUEST .zip and .txz are accepted as input files. The file to be analysed can be selected using the dropdown menu.
.. class:: infomark
-Note: Files can be uploaded by using āget dataā and āupload fileā and selecting āIMGT archiveā as a file type.
+Note: Files can be uploaded by using āget dataā and āupload fileā and selecting āIMGT archiveā as a file type. Special characters should be prevented in the file names of the uploaded samples as these can give errors when running the immune repertoire pipeline. Underscores are allowed in the file names.
-----
@@ -108,15 +108,15 @@
Identifies the region which will be included in the analysis (analysed region)
-- Sequences which are missing a gene region (FR1/CDR1 etc) in the analysed region are excluded
-- Sequences containing an ambiguous base in the analysed region are excluded
-- All other filtering/analysis is based on the analysed region
+- Sequences which are missing a gene region (FR1/CDR1 etc) in the analysed region are excluded.
+- Sequences containing an ambiguous base in the analysed region or the CDR3 are excluded.
+- All other filtering/analysis is based on the analysed region.
-----
**Functionality filter**
-Allows filtering on productive rearrangement, unproductive rearrangements or both based on the assignment provided by IMGT.
+Allows filtering on productive rearrangements, unproductive rearrangements or both based on the assignment provided by IMGT.
**Filter unique sequences**
@@ -125,13 +125,13 @@
This filter consists of two different steps.
-Step 1: removes all sequences of which the nucleotide sequence in the āanalysed regionā (see sequence starts at filter) occurs only once. (Sub)classes are not taken into account in this filter step.
+Step 1: removes all sequences of which the nucleotide sequence in the āanalysed regionā and the CDR3 (see sequence starts at filter) occurs only once. (Sub)classes are not taken into account in this filter step.
-Step 2: removes all duplicate sequences (sequences with the exact same nucleotide sequence in the analysed region and the same (sub)class).
+Step 2: removes all duplicate sequences (sequences with the exact same nucleotide sequence in the analysed region, the CDR3 and the same (sub)class).
.. class:: infomark
-Note: This means that sequences with the same nucleotide sequence but a different (sub)class will be included in the results of both (sub)classes.
+This means that sequences with the same nucleotide sequence but a different (sub)class will be included in the results of both (sub)classes.
*Keep unique:*
@@ -167,7 +167,7 @@
.. class:: infomark
-Note: The first sequence (in the data set) of each clone is always included in the analysis. When the first matched sequence is unmatched (no subclass assigned) the first matched sequence will be included. This means that altering the data order (by for instance sorting) can change the sequence which is included in the analysis and therefore slightly influence results.
+Note: The first sequence (in the data set) of each clone is always included in the analysis. When the first matched sequence is unmatched (no subclass assigned) the first matched sequence will be included. This means that altering the data order (by for instance sorting) can change the sequence which is included in the analysis and therefore slightly influences the results.
-----
@@ -175,21 +175,27 @@
.. class:: warningmark
-Note: This filter should only be applied when analysing human IGH data in which a (sub)class specific sequence is present. Otherwise please select the "do not assign (sub)class" option to prevent errors when running the pipeline.
+Note: This filter should only be applied when analysing human IGH data in which a (sub)class specific sequence is present. Otherwise please select the do not assign (sub)class option to prevent errors when running the pipeline.
The class percentage is based on the āchunk hit percentageā (see below). The subclass percentage is based on the ānt hit percentageā (see below).
The SHM & CSR pipeline identifies human CĀµ, CĪ±, CĪ³ and CĪµ constant genes by dividing the reference sequences for the subclasses (NG_001019) in 8 nucleotide chunks which overlap by 4 nucleotides. These overlapping chunks are then individually aligned in the right order to each input sequence. This alignment is used to calculate the chunck hit percentage and the nt hit percentage.
-*Chunk hit percentage*: the percentage of the chunks that is aligned
+*Chunk hit percentage*: The percentage of the chunks that is aligned
-*Nt hit percentage*: The percentage of chunks covering the subclass specific nucleotide match with the different subclasses. The most stringent filter for the subclass is 70% ānt hit percentageā which means that 5 out of 7 subclass specific nucleotides for CĪ± or 6 out of 8 subclass specific nucleotides of CĪ³ should match with the specific subclass.
+*Nt hit percentage*: The percentage of chunks covering the subclass specific nucleotide match with the different subclasses. The most stringent filter for the subclass is 70% ānt hit percentageā which means that 5 out of 7 subclass specific nucleotides for CĪ± or 6 out of 8 subclass specific nucleotides of CĪ³ should match with the specific subclass.
-----
**Output new IMGT archives per class into your history?**
-If yes is selected, additional output files (one for each class) will be added to the history which contain information of the sequences that passed the selected filtering criteria. These files are in the same format as the IMGT/HighV-QUEST output files and therefore are also compatible with many other analysis programs, such as IGGalaxy.
+If yes is selected, additional output files (one for each class) will be added to the history which contain information of the sequences that passed the selected filtering criteria. These files are in the same format as the IMGT/HighV-QUEST output files and therefore are also compatible with many other analysis programs, such as the Immune repertoire pipeline.
+
+-----
+
+**Execute**
+
+Upon pressing execute a new analysis is added to your history (right side of the page). Initially this analysis will be grey, after initiating the analysis colour of the analysis in the history will change to yellow. When the analysis is finished it will turn green in the history. Now the analysis can be opened by clicking on the eye icon on the analysis of interest. When an analysis turns red an error has occurred when running the analysis. If you click on the analysis title additional information can be found on the analysis. In addition a bug icon appears. Here more information on the error can be found.
]]>
diff -r 05c62efdc393 -r a24f8c93583a shm_downloads.htm
--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/shm_downloads.htm Thu Dec 22 09:39:27 2016 -0500
@@ -0,0 +1,538 @@
+
+
+
+
+
+
+
+
+
+
+
+
+
+
Info
+
+
The complete
+dataset:
+Allows downloading of the complete parsed data set.
+
+
The filtered
+dataset:
+Allows downloading of all parsed IMGT information of all transcripts that
+passed the chosen filter settings.
+
+
The alignment
+info on the unmatched sequences: Provides information of the subclass
+alignment of all unmatched sequences. For each sequence the chunck hit
+percentage and the nt hit percentage is shown together with the best matched
+subclass.
+
+
SHM Overview
+
+
The SHM Overview
+table as a dataset: Allows downloading of the SHM Overview
+table as a data set.
+
+
Motif data per
+sequence ID: Provides a file that contains information for each
+transcript on the number of mutations present in WA/TW and RGYW/WRCY motives.
+
+
Mutation data
+per sequence ID: Provides a file containing information
+on the number of sequences bases, the number and location of mutations and the
+type of mutations found in each transcript.
+
+
Base count for
+every sequence: links to a page showing for each transcript the
+sequence of the analysed region (as dependent on the sequence starts at filter),
+the assigned subclass and the number of sequenced A,C,G and Ts.
+
+
The data used to
+generate the percentage of mutations in AID and pol eta motives plot:
+Provides a file containing the values used to generate the percentage of
+mutations in AID and pol eta motives plot in the SHM overview tab.
+
+
The
+data used to generate the relative mutation patterns plot:
+Provides a download with the data used to generate the relative mutation
+patterns plot in the SHM overview tab.
+
+
The
+data used to generate the absolute mutation patterns plot:
+Provides a download with the data used to generate the absolute mutation
+patterns plot in the SHM overview tab.
+
+
SHM Frequency
+
+
The data
+generate the frequency scatter plot: Allows
+downloading the data used to generate the frequency scatter plot in the SHM
+frequency tab.
+
+
The data used to
+generate the frequency by class plot: Allows
+downloading the data used to generate frequency by class plot included in the
+SHM frequency tab.
+
+
The data for
+frequency by subclass: Provides information of the number and
+percentage of sequences that have 0%, 0-2%, 2-5%, 5-10%, 10-15%, 15-20%,
+>20% SHM. Information is provided for each subclass.
+
+
+
+
Transition
+Tables
+
+
The data for the
+'all' transition plot: Contains the information used to
+generate the transition table for all sequences.
+
+
The data for the
+'IGA' transition plot: Contains the information used to
+generate the transition table for all IGA sequences.
+
+
The data for the
+'IGA1' transition plot: Contains the information used to
+generate the transition table for all IGA1 sequences.
+
+
The data for the
+'IGA2' transition plot: Contains the information used to
+generate the transition table for all IGA2 sequences.
+
+
The data for the
+'IGG' transition plot : Contains the information used to
+generate the transition table for all IGG sequences.
+
+
The data for the
+'IGG1' transition plot: Contains the information used to
+generate the transition table for all IGG1 sequences.
+
+
The data for the
+'IGG2' transition plot: Contains the information used to
+generate the transition table for all IGG2 sequences.
+
+
The data for the
+'IGG3' transition plot: Contains the information used to
+generate the transition table for all IGG3 sequences.
+
+
The data for the
+'IGG4' transition plot: Contains the information used to
+generate the transition table for all IGG4 sequences.
+
+
The data for the
+'IGM' transition plot : Contains the information used to
+generate the transition table for all IGM sequences.
+
+
The data for the
+'IGE' transition plot: Contains the
+information used to generate the transition table for all IGE sequences.
+
+
Antigen
+selection
+
+
AA mutation data
+per sequence ID: Provides for each transcript information on whether
+there is replacement mutation at each amino acid location (as defined by IMGT).
+For all amino acids outside of the analysed region the value 0 is given.
+
+
Presence of AA
+per sequence ID: Provides for each transcript information on which
+amino acid location (as defined by IMGT) is present. 0 is absent, 1
+is present.
+
+
The data used to
+generate the aa mutation frequency plot: Provides the
+data used to generate the aa mutation frequency plot for all sequences in the
+antigen selection tab.
+
+
The data used to
+generate the aa mutation frequency plot for IGA: Provides the
+data used to generate the aa mutation frequency plot for all IGA sequences in
+the antigen selection tab.
+
+
The data used to
+generate the aa mutation frequency plot for IGG: Provides the
+data used to generate the aa mutation frequency plot for all IGG sequences in
+the antigen selection tab.
+
+
The data used to
+generate the aa mutation frequency plot for IGM: Provides the
+data used to generate the aa mutation frequency plot for all IGM sequences in
+the antigen selection tab.
+
+
The data used to
+generate the aa mutation frequency plot for IGE: Provides the
+data used to generate the aa mutation frequency plot for all IGE sequences in
+the antigen selection tab.
+
+
Baseline PDF (http://selection.med.yale.edu/baseline/): PDF
+containing the Antigen selection (BASELINe) graph for all
+sequences.
+
+
Baseline data:
+Table output of the BASELINe analysis. Calculation of antigen selection as
+performed by BASELINe are shown for each individual sequence and the sum of all
+sequences.
+
+
Baseline IGA
+PDF:
+PDF containing the Antigen selection (BASELINe) graph for all
+sequences.
+
+
Baseline IGA
+data:
+Table output of the BASELINe analysis. Calculation of antigen selection as
+performed by BASELINe are shown for each individual IGA sequence and the sum of
+all IGA sequences.
+
+
Baseline IGG
+PDF:
+PDF containing the Antigen selection (BASELINe) graph for all IGG
+sequences.
+
+
Baseline IGG
+data:
+Table output of the BASELINe analysis. Calculation of antigen selection as
+performed by BASELINe are shown for each individual IGG sequence and the sum of
+all IGG sequences.
+
+
Baseline IGM PDF: PDF
+containing the Antigen selection (BASELINe) graph for all IGM
+sequences.
+
+
Baseline IGM
+data:
+Table output of the BASELINe analysis. Calculation of antigen selection as
+performed by BASELINe are shown for each individual IGM sequence and the sum of
+all IGM sequences.
+
+
Baseline IGE
+PDF:
+PDF containing the Antigen selection (BASELINe) graph for all IGE
+sequences.
+
+
+
Baseline IGE
+data:
+Table output of the BASELINe analysis. Calculation of antigen selection as
+performed by BASELINe are shown for each individual IGE sequence and the sum of
+all IGE sequences.
+
+
CSR
+
+
The data for the
+IGA
+subclass distribution plot : Data used for
+the generation of the IGA subclass distribution plot provided
+in the CSR tab.
+
+
The data for the
+IGA
+subclass distribution plot : Data used for the generation of the IGG
+subclass distribution plot provided in the CSR tab.
+
+
Clonal relation
+
+
Sequence overlap
+between subclasses: Link to the overlap table as provided
+under the clonality overlap tab.
+
+
The Change-O DB
+file with defined clones and subclass annotation:
+Downloads a table with the calculation of clonal relation between all
+sequences. For each individual transcript the results of the clonal assignment
+as provided by Change-O are provided. Sequences with the same number in the CLONE
+column are considered clonally related.
+
+
The Change-O DB
+defined clones summary file: Gives a summary of the total number of
+clones in all sequences and their clone size.
+
+
The Change-O DB
+file with defined clones of IGA: Downloads a table with the
+calculation of clonal relation between all IGA sequences. For each individual
+transcript the results of the clonal assignment as provided by Change-O are
+provided. Sequences with the same number in the CLONE column are considered
+clonally related.
+
+
The Change-O DB
+defined clones summary file of IGA: Gives a summary
+of the total number of clones in all IGA sequences and their clone size.
+
+
The Change-O DB
+file with defined clones of IGG: Downloads a table with the
+calculation of clonal relation between all IGG sequences. For each individual
+transcript the results of the clonal assignment as provided by Change-O are
+provided. Sequences with the same number in the CLONE column are considered
+clonally related.
+
+
The Change-O DB
+defined clones summary file of IGG: Gives a summary
+of the total number of clones in all IGG sequences and their clone size.
+
+
The Change-O DB
+file with defined clones of IGM: Downloads a table
+with the calculation of clonal relation between all IGM sequences. For each
+individual transcript the results of the clonal assignment as provided by
+Change-O are provided. Sequences with the same number in the CLONE column are
+considered clonally related.
+
+
The Change-O DB
+defined clones summary file of IGM: Gives a summary
+of the total number of clones in all IGM sequences and their clone size.
+
+
The Change-O DB
+file with defined clones of IGE: Downloads a table with the
+calculation of clonal relation between all IGE sequences. For each individual
+transcript the results of the clonal assignment as provided by Change-O are
+provided. Sequences with the same number in the CLONE column are considered
+clonally related.
+
+
The Change-O DB
+defined clones summary file of IGE: Gives a summary
+of the total number of clones in all IGE sequences and their clone size.
+
+
Filtered IMGT
+output files
+
+
An IMGT archive
+with just the matched and filtered sequences: Downloads a
+.txz file with the same format as downloaded IMGT files that contains all
+sequences that have passed the chosen filter settings.
+
+
An IMGT archive
+with just the matched and filtered IGA sequences: Downloads a
+.txz file with the same format as downloaded IMGT files that contains all IGA
+sequences that have passed the chosen filter settings.
+
+
An IMGT archive
+with just the matched and filtered IGA1 sequences: Downloads a
+.txz file with the same format as downloaded IMGT files that contains all IGA1
+sequences that have passed the chosen filter settings.
+
+
An IMGT archive
+with just the matched and filtered IGA2 sequences: Downloads a .txz
+file with the same format as downloaded IMGT files that contains all IGA2
+sequences that have passed the chosen filter settings.
+
+
An IMGT archive
+with just the matched and filtered IGG sequences: Downloads a .txz
+file with the same format as downloaded IMGT files that contains all IGG
+sequences that have passed the chosen filter settings.
+
+
An IMGT archive
+with just the matched and filtered IGG1 sequences: Downloads a
+.txz file with the same format as downloaded IMGT files that contains all IGG1
+sequences that have passed the chosen filter settings.
+
+
An IMGT archive
+with just the matched and filtered IGG2 sequences: Downloads a
+.txz file with the same format as downloaded IMGT files that contains all IGG2
+sequences that have passed the chosen filter settings.
+
+
An IMGT archive
+with just the matched and filtered IGG3 sequences: Downloads a .txz
+file with the same format as downloaded IMGT files that contains all IGG3
+sequences that have passed the chosen filter settings.
+
+
An IMGT archive
+with just the matched and filtered IGG4 sequences: Downloads a
+.txz file with the same format as downloaded IMGT files that contains all IGG4
+sequences that have passed the chosen filter settings.
+
+
An IMGT archive
+with just the matched and filtered IGM sequences: Downloads a .txz
+file with the same format as downloaded IMGT files that contains all IGM
+sequences that have passed the chosen filter settings.
+
+
An IMGT archive
+with just the matched and filtered IGE sequences: Downloads a
+.txz file with the same format as downloaded IMGT files that contains all IGE
+sequences that have passed the chosen filter settings.
+
+
+
+
+
+
diff -r 05c62efdc393 -r a24f8c93583a shm_first.htm
--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/shm_first.htm Thu Dec 22 09:39:27 2016 -0500
@@ -0,0 +1,127 @@
+
+
+
+
+
+
+
+
+
+
+
+
+
+
Table showing the order of each
+filtering step and the number and percentage of sequences after each filtering
+step.
+
+
Input: The
+number of sequences in the original IMGT file. This is always 100% of the
+sequences.
+
+
After "no results" filter: IMGT
+classifies sequences either as "productive", "unproductive", "unknown", or "no
+results". Here, the number and percentages of sequences that are not classified
+as "no results" are reported.
+
+
After functionality filter: The
+number and percentages of sequences that have passed the functionality filter. The
+filtering performed is dependent on the settings of the functionality filter.
+Details on the functionality filter can be found on the start page of
+the SHM&CSR pipeline.
+
+
After
+removal sequences that are missing a gene region:
+In this step all sequences that are missing a gene region (FR1, CDR1, FR2,
+CDR2, FR3) that should be present are removed from analysis. The sequence
+regions that should be present are dependent on the settings of the sequence
+starts at filter. The number and
+percentage of sequences that pass this filter step are reported.
+
+
After
+N filter: In this step all sequences that contain
+an ambiguous base (n) in the analysed region or the CDR3 are removed from the
+analysis. The analysed region is determined by the setting of the sequence
+starts at filter. The number and percentage of sequences that pass this filter
+step are reported.
+
+
After
+filter unique sequences: The number and
+percentage of sequences that pass the "filter unique sequences" filter. Details
+on this filter can be found on the start page of
+the SHM&CSR pipeline
+
+
After
+remove duplicate based on filter: The number and
+percentage of sequences that passed the remove duplicate filter. Details on the
+"remove duplicate filter based on filter" can be found on the start page of the
+SHM&CSR pipeline.
+
+
Number of matches sequences:
+The number and percentage of sequences that passed all the filters described
+above and have a (sub)class assigned.
+
+
Number
+of unmatched sequences: The number and percentage
+of sequences that passed all the filters described above and do not have
+subclass assigned.
+
+
+
+
+
+
+
+
diff -r 05c62efdc393 -r a24f8c93583a shm_frequency.htm
--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/shm_frequency.htm Thu Dec 22 09:39:27 2016 -0500
@@ -0,0 +1,87 @@
+
+
+
+
+
+
+
+
+
+
+
+
+
+
SHM
+frequency tab
+
+
Graphs
+
+
These
+graphs give insight into the level of SHM. The data represented in these graphs
+can be downloaded in the download tab. More
+information on the values found in healthy individuals of different ages can be
+found in IJspeert and van Schouwenburg et al, PMID: 27799928.
+
+
Frequency
+scatter plot
+
+
A
+dot plot showing the percentage of SHM in each transcript divided into the
+different (sub)classes. In the graph each dot
+represents an individual transcript.
+
+
Mutation
+frequency by class
+
+
A
+bar graph showing the percentage of transcripts that contain 0%, 0-2%, 2-5%,
+5-10% 10-15%, 15-20% or more than 20% SHM for each subclass.
+
+
Hanna IJspeert, Pauline A. van
+Schouwenburg, David van Zessen, Ingrid Pico-Knijnenburg, Gertjan J. Driessen,
+Andrew P. Stubbs, and Mirjam van der Burg (2016). Evaluation
+of the Antigen-Experienced B-Cell Receptor Repertoire in Healthy Children and
+Adults. In Frontiers in Immunolog, 7, pp. e410-410. [doi:10.3389/fimmu.2016.00410][Link]
+
+
+
+
+
+
diff -r 05c62efdc393 -r a24f8c93583a shm_overview.htm
--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/shm_overview.htm Thu Dec 22 09:39:27 2016 -0500
@@ -0,0 +1,332 @@
+
+
+
+
+
+
+
+
+
+
+
+
+
+
Info
+table
+
+
This
+table contains information on different characteristics of SHM. For all
+characteristics information can be found for all sequences or only sequences of
+a certain (sub)class. All results are based on the sequences that passed the filter
+settings chosen on the start page of the SHM & CSR pipeline and only
+include details on the analysed region as determined by the setting of the
+sequence starts at filter. All data in this table can be downloaded via the
+downloads tab.
+
+
Mutation
+frequency:
+
+
These values
+give information on the level of SHM. More information
+on the values found in healthy individuals of different ages can be found in IJspeert
+and van Schouwenburg et al, PMID: 27799928
+
+
Number
+of mutations: Shows the number of total
+mutations / the number of sequenced bases (the % of mutated bases).
+
+
Median
+number of mutations: Shows the median % of
+SHM of all sequences.
+
+
Patterns
+of SHM:
+
+
These values
+give insights into the targeting and patterns of SHM. These values can give
+insight into the repair pathways used to repair the U:G mismatches introduced
+by AID. More information
+on the values found in healthy individuals of different ages can be found in
+IJspeert and van Schouwenburg et al, PMID: 27799928
+
+
Transitions:
+Shows the number of transition mutations / the number of total mutations (the
+percentage of mutations that are transitions). Transition mutations are C>T,
+T>C, A>G, G>A.
+
+
Transversions:
+Shows the number of transversion mutations / the number of total mutations (the
+percentage of mutations that are transitions). Transversion mutations are
+C>A, C>G, T>A, T>G, A>T, A>C, G>T, G>C.
+
+
Transitions
+at GC: Shows the number of transitions at GC locations (C>T,
+G>A) / the total number of mutations at GC locations (the percentage of
+mutations at GC locations that are transitions).
+
+
Targeting
+of GC: Shows the number of mutations at GC
+locations / the total number of mutations (the percentage of total mutations
+that are at GC locations).
+
+
Transitions
+at AT: Shows the number of transitions at AT
+locations (T>C, A>G) / the total number of mutations at AT locations (the
+percentage of mutations at AT locations that are transitions).
+
+
Targeting
+of AT: Shows the number of mutations at AT
+locations / the total number of mutations (the percentage of total mutations
+that are at AT locations).
+
+
RGYW:
+Shows
+the number of mutations that are in a RGYW motive / The number of total mutations
+(the percentage of mutations that are in a RGYW motive). RGYW motives are known to be
+preferentially targeted by AID (R=Purine,
+Y=pyrimidine, W = A or T).
+
+
WRCY:
+Shows the number of mutations
+that are in a WRCY motive / The number of
+total mutations (the percentage of mutations that are in a WRCY motive). WRCY
+motives are known to be preferentially targeted by AID (R=Purine,
+Y=pyrimidine, W = A or T).
+
+
WA:
+Shows
+the number of mutations that are in a WA motive / The number of total mutations
+(the percentage of mutations that are in a WA motive). It is described that
+polymerase eta preferentially makes errors at WA motives (W
+= A or T).
+
+
TW:
+Shows the number of mutations that are in a TW motive / The number of total mutations
+(the percentage of mutations that are in a TW motive). It is described that
+polymerase eta preferentially makes errors at TW motives (W
+= A or T).
+
+
Antigen
+selection:
+
+
These
+values give insight into antigen selection. It has been described that during
+antigen selection, there is selection against replacement mutations in the FR
+regions as these can cause instability of the B-cell receptor. In contrast
+replacement mutations in the CDR regions are important for changing the
+affinity of the B-cell receptor and therefore there is selection for this type
+of mutations. Silent mutations do not alter the amino acid sequence and
+therefore do not play a role in selection. More information on the values found
+in healthy individuals of different ages can be found in IJspeert and van
+Schouwenburg et al, PMID: 27799928
+
+
FR
+R/S: Shows the number of replacement
+mutations in the FR regions / The number of silent mutations in the FR regions
+(the number of replacement mutations in the FR regions divided by the number of
+silent mutations in the FR regions)
+
+
CDR
+R/S: Shows the number of replacement
+mutations in the CDR regions / The number of silent mutations in the CDR
+regions (the number of replacement mutations in the CDR regions divided by the
+number of silent mutations in the CDR regions)
+
+
Number
+of sequences nucleotides:
+
+
These
+values give information on the number of sequenced nucleotides.
+
+
Nt
+in FR: Shows the number of sequences bases
+that are located in the FR regions / The total number of sequenced bases (the
+percentage of sequenced bases that are present in the FR regions).
+
+
Nt
+in CDR: Shows the number of sequenced bases
+that are located in the CDR regions / The total number of sequenced bases (the percentage of
+sequenced bases that are present in the CDR regions).
+
+
A:
+Shows the total number of sequenced
+adenines / The total number of sequenced bases (the percentage of sequenced
+bases that were adenines).
+
+
C:
+Shows
+the total number of sequenced cytosines / The total number of sequenced bases
+(the percentage of sequenced bases that were cytosines).
+
+
T:
+Shows
+the total number of sequenced thymines
+/ The total number of sequenced bases (the percentage of sequenced bases that
+were thymines).
+
+
G:
+Shows the total number of sequenced guanines / The total number of
+sequenced bases (the percentage of sequenced bases that were guanines).
+
+
Graphs
+
+
These graphs visualize
+information on the patterns and targeting of SHM and thereby give information
+into the repair pathways used to repair the U:G mismatches introduced by AID. The
+data represented in these graphs can be downloaded in the download tab. More
+information on the values found in healthy individuals of different ages can be
+found in IJspeert and van Schouwenburg et al, PMID: 27799928.
+
+
+
Percentage
+of mutations in AID and pol eta motives
+
+
Visualizes
+for each
+(sub)class the percentage of mutations that are present in AID (RGYW or
+WRCY) or polymerase eta motives (WA or TW) in the different subclasses (R=Purine,
+Y=pyrimidine, W = A or T).
+
+
Relative
+mutation patterns
+
+
Visualizes
+for each (sub)class the distribution of mutations between mutations at AT
+locations and transitions or transversions at GC locations.
+
+
Absolute
+mutation patterns
+
+
Visualized
+for each (sub)class the percentage of sequenced AT and GC bases that are
+mutated. The mutations at GC bases are divided into transition and transversion
+mutations.
+
+
Hanna IJspeert, Pauline A. van
+Schouwenburg, David van Zessen, Ingrid Pico-Knijnenburg, Gertjan J. Driessen,
+Andrew P. Stubbs, and Mirjam van der Burg (2016). Evaluation
+of the Antigen-Experienced B-Cell Receptor Repertoire in Healthy Children and
+Adults. In Frontiers in Immunolog, 7, pp. e410-410. [doi:10.3389/fimmu.2016.00410][Link]
+
+
+
+
+
+
diff -r 05c62efdc393 -r a24f8c93583a shm_selection.htm
--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/shm_selection.htm Thu Dec 22 09:39:27 2016 -0500
@@ -0,0 +1,128 @@
+
+
+
+
+
+
+
+
+
+
+
+
+
+
References
+
+
Yaari, G. and Uduman, M. and Kleinstein, S. H. (2012). Quantifying
+selection in high-throughput Immunoglobulin sequencing data sets. In Nucleic Acids Research, 40 (17),
+pp. e134e134. [doi:10.1093/nar/gks457][Link]
+
+
Graphs
+
+
AA
+mutation frequency
+
+
For
+each class, the frequency of replacement mutations at each amino acid position
+is shown, which is calculated by dividing the number of replacement mutations
+at a particular amino acid position/the number sequences that have an amino
+acid at that particular position. Since the length of the CDR1 and CDR2 region
+is not the same for every VH gene, some amino acids positions are absent.
+Therefore we calculate the frequency using the number of amino acids present at
+that that particular location.
+
+
Antigen
+selection (BASELINe)
+
+
Shows
+the results of the analysis of antigen selection as performed using BASELINe.
+Details on the analysis performed by BASELINe can be found in Yaari et al,
+PMID: 22641856. The settings used for the analysis are:
+focused, SHM targeting model: human Tri-nucleotide, custom bounderies. The
+custom boundries are dependent on the sequence starts at filter.
+
+
Leader:
+1:26:38:55:65:104:-
+
+
FR1: 27:27:38:55:65:104:-
+
+
CDR1: 27:27:38:55:65:104:-
+
+
FR2: 27:27:38:55:65:104:-
+
+
Hanna IJspeert, Pauline A. van
+Schouwenburg, David van Zessen, Ingrid Pico-Knijnenburg, Gertjan J. Driessen,
+Andrew P. Stubbs, and Mirjam van der Burg (2016). Evaluation
+of the Antigen-Experienced B-Cell Receptor Repertoire in Healthy Children and
+Adults. In Frontiers in Immunolog, 7, pp. e410-410. [doi:10.3389/fimmu.2016.00410][Link]
+
+
+
+
+
+
diff -r 05c62efdc393 -r a24f8c93583a shm_transition.htm
--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/shm_transition.htm Thu Dec 22 09:39:27 2016 -0500
@@ -0,0 +1,120 @@
+
+
+
+
+
+
+
+
+
+
+
+
+
+
These graphs and
+tables give insight into the targeting and patterns of SHM. This can give
+insight into the DNA repair pathways used to solve the U:G mismatches
+introduced by AID. More information on the values found in healthy individuals
+of different ages can be found in IJspeert and van Schouwenburg et al, PMID:
+27799928.
+
+
Graphs
+
+
+
Heatmap transition
+information
+
+
Heatmaps visualizing for each subclass the frequency
+of all possible substitutions. On the x-axes the original base is shown, while
+the y-axes shows the new base. The darker the shade of blue, the more frequent
+this type of substitution is occurring.
+
+
Bargraph
+transition information
+
+
Bar graph
+visualizing for each original base the distribution of substitutions into the other
+bases. A graph is included for each (sub)class.
+
+
Tables
+
+
Transition
+tables are shown for each (sub)class. All the original bases are listed
+horizontally, while the new bases are listed vertically.
+
+
Hanna IJspeert, Pauline A. van
+Schouwenburg, David van Zessen, Ingrid Pico-Knijnenburg, Gertjan J. Driessen,
+Andrew P. Stubbs, and Mirjam van der Burg (2016). Evaluation
+of the Antigen-Experienced B-Cell Receptor Repertoire in Healthy Children and
+Adults. In Frontiers in Immunolog, 7, pp. e410-410. [doi:10.3389/fimmu.2016.00410][Link]
+
+
+
+
+
+
diff -r 05c62efdc393 -r a24f8c93583a wrapper.sh
--- a/wrapper.sh Tue Dec 20 09:03:15 2016 -0500
+++ b/wrapper.sh Thu Dec 22 09:39:27 2016 -0500
@@ -247,7 +247,7 @@
echo "---------------- pattern_plots.r ----------------"
echo "---------------- pattern_plots.r ----------------
" >> $log
- Rscript $dir/pattern_plots.r $outdir/data_${func}.txt $outdir/plot1 $outdir/plot2 $outdir/plot3 $outdir/shm_overview.txt 2>&1
+ Rscript $dir/pattern_plots.r $outdir/data_${func}.txt $outdir/aid_motives $outdir/relative_mutations $outdir/abolute_mutations $outdir/shm_overview.txt 2>&1
echo "" >> $output
+echo "
" >> $output
+cat $dir/shm_transition.htm >> $output
+
echo "" >> $output #transition tables tab end
echo "" >> $output
@@ -428,7 +435,7 @@
mkdir $outdir/baseline/IGA_IGG_IGM
if [[ $(wc -l < $outdir/new_IMGT/1_Summary.txt) -gt "1" ]]; then
cd $outdir/baseline/IGA_IGG_IGM
- bash $dir/baseline/wrapper.sh 1 1 1 1 0 0 "${baseline_boundaries}" $outdir/new_IMGT.txz "IGA_IGG_IGM" "$dir/baseline/IMGTVHreferencedataset20161215.fa" "$outdir/baseline.pdf" "Sequence.ID" "$outdir/baseline.txt"
+ bash $dir/baseline/wrapper.sh 1 1 1 1 0 0 "${baseline_boundaries}" $outdir/new_IMGT.txz "IGA_IGG_IGM_IGE" "$dir/baseline/IMGTVHreferencedataset20161215.fa" "$outdir/baseline.pdf" "Sequence.ID" "$outdir/baseline.txt"
else
echo "No sequences" > "$outdir/baseline.txt"
fi
@@ -496,6 +503,9 @@
fi
fi
+echo "
" >> $output
+cat $dir/shm_selection.htm >> $output
+
echo "
" >> $output #antigen selection tab end
echo "" >> $output #CSR tab
@@ -509,6 +519,9 @@
echo "
" >> $output
fi
+echo "
" >> $output
+cat $dir/shm_csr.htm >> $output
+
echo "
" >> $output #CSR tab end
if [[ "$fast" == "no" ]] ; then
@@ -562,7 +575,7 @@
PWD="$tmp"
- echo "" >> $output #clonality tab
+ echo "
" >> $output #clonality tab
function clonality_table {
local infile=$1
@@ -606,13 +619,15 @@
clonality_table $outdir/change_o/change-o-defined_clones-summary-IGM.txt $output
echo "
" >> $output
- echo "
" >> $output
- cat "$outdir/sequence_overview/index.html" | sed "s%href='\(.*\).html%href='sequence_overview/\1.html%g" >> $output # rewrite href to 'sequence_overview/..."
+ echo "
" >> $output
+ cat "$outdir/sequence_overview/index.html" | sed -e 's::\n:g' | sed "s:href='\(.*\).html:href='sequence_overview/\1.html:g" >> $output # rewrite href to 'sequence_overview/..."
echo "
" >> $output
-
-
+
echo "
" >> $output #clonality tabber end
-
+
+ echo "
" >> $output
+ cat $dir/shm_clonality.htm >> $output
+
echo "
" >> $output #clonality tab end
fi
@@ -630,9 +645,9 @@
echo "Motif data per sequence ID | Download |
" >> $output
echo "Mutation data per sequence ID | Download |
" >> $output
echo "Base count for every sequence | View |
" >> $output
-echo "The data used to generate the RGYW/WRCY and TW/WA plot | Download |
" >> $output
-echo "The data used to generate the relative transition and transversion plot | Download |
" >> $output
-echo "The data used to generate the absolute transition and transversion plot | Download |
" >> $output
+echo "The data used to generate the percentage of mutations in AID and pol eta motives plot | Download |
" >> $output
+echo "The data used to generate the relative mutation patterns plot | Download |
" >> $output
+echo "The data used to generate the absolute mutation patterns plot | Download |
" >> $output
echo "SHM Frequency |
" >> $output
echo "The data generate the frequency scatter plot | Download |
" >> $output
@@ -654,7 +669,7 @@
echo "Antigen Selection |
" >> $output
echo "AA mutation data per sequence ID | Download |
" >> $output
-echo "Absent AA location data per sequence ID | Download |
" >> $output
+echo "Presence of AA per sequence ID | Download |
" >> $output
echo "The data used to generate the aa mutation frequency plot | Download |
" >> $output
echo "The data used to generate the aa mutation frequency plot for IGA | Download |
" >> $output
@@ -674,10 +689,10 @@
echo "Baseline IGE data | Download |
" >> $output
echo "CSR |
" >> $output
-echo "The data for the CSR IGA pie plot | Download |
" >> $output
-echo "The data for the CSR IGG pie plot | Download |
" >> $output
+echo "The data for the IGA subclass distribution plot | Download |
" >> $output
+echo "The data for the IGG subclass distribution plot | Download |
" >> $output
-echo "Clonality |
" >> $output
+echo "Clonal Relation |
" >> $output
echo "Sequence overlap between subclasses | View |
" >> $output
echo "The Change-O DB file with defined clones and subclass annotation | Download |
" >> $output
echo "The Change-O DB defined clones summary file | Download |
" >> $output
@@ -705,6 +720,9 @@
echo "
" >> $output
+echo "