Analysis of antibody repertoires by high throughput sequencing is of major importance in understanding adaptive immune responses. Our knowledge of variations in the genomic loci encoding antibody genes is incomplete, mostly due to technical difficulties in aligning short reads to these highly repetitive loci. The partial knowledge results in conflicting V-D-J gene assignments between different algorithms, and biased genotype and haplotype inference. Previous studies have shown that haplotypes can be inferred by taking advantage of IGHJ6 heterozygosity, observed in approximately one third of the population [1], [2].
Here we provide a robust novel method for determining V-D-J haplotypes by adapting a Bayesian framework, RAbHIT. Our method extends haplotype inference to IGHD and IGHV based analysis, thereby enabling inference of complex genetic events like deletions and copy number variations in the entire population. It calculates a Bayes factor, a number that indicates the certainty level of the inference, for each haplotyped gene.
More details can be found here:
The simplest way to install RAbHIT is through CRAN
To install RAbHIT from the source code please use the code below:
For further installations instruction please refer to https://bitbucket.org/yaarilab/rabhit
RAbHIT requires two main inputs:
Antibody repertoire sequencing data is in a data frame format. Each row represents a unique observation and columns represent data about that observation. The names of the required columns are provided below along with a short description.
| Column Name | Description |
|---|---|
SUBJECT |
Subject name |
V_CALL |
(Comma separated) name(s) of the nearest V allele(s) (IMGT format) |
D_CALL |
(Comma separated) name(s) of the nearest D allele(s) |
J_CALL |
(Comma separated) name(s) of the nearest J allele(s) |
An example dataset is provided with RAbHIT. It contains unique naive b-cell sequences, from multiple individuals. One individual in the example data set appears twice, once with full V coverage and once partial V coverage (I5 and I5_FR2 respectively).
The database of germline sequences should be provided in a FASTA format with sequences gaped according to the IMGT numbering scheme[3]. IGHV, IGHD, and IGHJ alleles in the IMGT database (release 2018-12-4) are provided with this package (HVGERM, HDGERM, and HJGERM). We removed the following IGHV and IGHD alleles, as they are duplicated from other allele assignments (All duplicates are mentioned in IMGT).
| V/D allele | V/D allele duplicated (removed) |
|---|---|
| IGHV1-69*01 | IGHV1-69D*01 |
| IGHV2-70*04 | IGHV2-70D*04 |
| IGHV3-23*01 | IGHV3-23D*01 |
| IGHV3-30*04 | IGHV3-30-3*03 |
| IGHV3-29*01 | IGHV3-30-4\2*01 |
| IGHV3-30*18 | IGHV3-30-5*01 |
| IGHV3-30*02 | IGHV3-30-5*02 |
| IGHD4-11*01 | IGHD4-4*01 |
| IGHD5-18*01 | IGHD5-5*01 |
To obtain the most reliable result we suggest following the data pre-processing steps below.
To load RAbHIT we’ll run the following:
The functions provided by this package can be used to perform any combination of the following:
An individual’s haplotype can be inferred using the function createFullHaplotype. The function infers the haplotype based on the provided anchor gene. Using this function, a contingency table is created for each gene, from which a strand is inferred for each allele. The user can set the anchor gene for haplotyping as well as the column for which a haplotype should be inferred.
# Load example sequence data and example germline database
data(samples_db, HVGERM, HDGERM)
# Selecting a single individual
clip_db <- samples_db[samples_db$subject=='I5', ]
# Inferring haplotype using J6 as anchor
haplo_db_J6 <- createFullHaplotype(clip_db, toHap_col=c("v_call","d_call"),
hapBy_col="j_call", hapBy="IGHJ6",
toHap_GERM=c(HVGERM, HDGERM))We can plot the haplotype map using the plotHaplotype function. Each row represents a different gene and the genes are ordered by the chromosome location. Each color represents the different alleles, light gray represents an unknown, dark gray a deletion and off-white a non-reliable allele annotation. The blues represent the confidence level of the inference (lK). The most left panels show the count for each allele on each chromosome. The middle panels show the allele inference for each gene on each chromosome, The most right panels show the confidence level of the inference. non reliable alleles are annotated with [*] and a number and are found below the graph.
One of the advantages of RAbHIT, is the ability to infer haplotype by other anchor genes than J6. RAbHIT offers utilizing any gene as anchor, however, we recommend that the fraction of the least frequent allele of the anchor gene will be above 0.3. So far only J6, D2-8, and D2-21 has been proven to infer the heavy chain loci correctly. Here, we chose a single individual from our example dataset with heterozygousity in J6 and D2-21, inferred haplotypes according to both anchors, compared them using the hapDendo function, and calculated the Jaccard distance between the inferred haplotypes.
# Inferring haplotype using D2-21 as anchor
haplo_db_D2_21 <- createFullHaplotype(clip_db, toHap_col="v_call",
hapBy_col="d_call", hapBy="IGHD2-21", toHap_GERM=HVGERM)To combine the two data frames we need to associate each of the D2-21 alleles with J6 alleles. We can do that from the haplotype inference with the J6 as anchor
D2-21*02 goes with J6*02 and that D2-21*01 goes with J6*03. Now to bind the data frames together, will change the sample name to match the anchor gene used in each haplotype, and change the anchor gene columns to a generic name.
# rename the subject
haplo_db_J6$subject <- 'J6'
haplo_db_D2_21$subject <- 'D2-21'
# change the anchor gene columns
# For D2-21*01 and J6*03 we will change the column to Anchor_J03_D01
names(haplo_db_J6)[which(names(haplo_db_J6)=='IGHJ6_03')] <- "AnchorJ03D01"
names(haplo_db_D2_21)[which(names(haplo_db_D2_21)=='IGHD2-21_01')] <- "AnchorJ03D01"
# For D2-21*02 and J6*02 we will change the column to Anchor_J02_D02
names(haplo_db_J6)[which(names(haplo_db_J6)=='IGHJ6_02')] <- "AnchorJ02D02"
names(haplo_db_D2_21)[which(names(haplo_db_D2_21)=='IGHD2-21_02')] <- "AnchorJ02D02"
# Subseting the haplo_db_J6 dataset to include only the V genes
haplo_db_J6 <- haplo_db_J6[grep('IGHV',haplo_db_J6$gene),]
# Combining the datasets rowwise
haplo_comb <- rbind(haplo_db_J6,haplo_db_D2_21)Now, we can compare the haplotypes using the hapDendo function.