RAbHIT: R Antibody Haplotype Inference Tool
logo

Introduction

Analysis of antibody repertoires by high throughput sequencing is of major importance in understanding adaptive immune responses. Our knowledge of variations in the genomic loci encoding antibody genes is incomplete, mostly due to technical difficulties in aligning short reads to these highly repetitive loci. The partial knowledge results in conflicting V-D-J gene assignments between different algorithms, and biased genotype and haplotype inference. Previous studies have shown that haplotypes can be inferred by taking advantage of IGHJ6 heterozygosity, observed in approximately one third of the population [1], [2].

Here we provide a robust novel method for determining V-D-J haplotypes by adapting a Bayesian framework, RAbHIT. Our method extends haplotype inference to IGHD and IGHV based analysis, thereby enabling inference of complex genetic events like deletions and copy number variations in the entire population. It calculates a Bayes factor, a number that indicates the certainty level of the inference, for each haplotyped gene.

More details can be found here:

Gidoni, Moriah, et al. “Mosaic deletion patterns of the human antibody heavy chain gene locus shown by Bayesian haplotyping.” Nature Communications 10.1 (2019): 628.

Installation

Download

The latest stable release of RAbHIT can be downloaded from CRAN or Bitbucket.

Installing Released Versions

The simplest way to install RAbHIT is through CRAN

Building Development Versions

To install RAbHIT from the source code please use the code below:

For further installations instruction please refer to https://bitbucket.org/yaarilab/rabhit

Input

RAbHIT requires two main inputs:

  1. Pre-processed antibody repertoire sequencing data with heterozygosity in at least one gene
  2. Database of germline gene sequences

Antibody repertoire sequencing data is in a data frame format. Each row represents a unique observation and columns represent data about that observation. The names of the required columns are provided below along with a short description.

Column Name Description
SUBJECT Subject name
V_CALL (Comma separated) name(s) of the nearest V allele(s) (IMGT format)
D_CALL (Comma separated) name(s) of the nearest D allele(s)
J_CALL (Comma separated) name(s) of the nearest J allele(s)

An example dataset is provided with RAbHIT. It contains unique naive b-cell sequences, from multiple individuals. One individual in the example data set appears twice, once with full V coverage and once partial V coverage (I5 and I5_FR2 respectively).

The database of germline sequences should be provided in a FASTA format with sequences gaped according to the IMGT numbering scheme[3]. IGHV, IGHD, and IGHJ alleles in the IMGT database (release 2018-12-4) are provided with this package (HVGERM, HDGERM, and HJGERM). We removed the following IGHV and IGHD alleles, as they are duplicated from other allele assignments (All duplicates are mentioned in IMGT).

V/D allele V/D allele duplicated (removed)
IGHV1-69*01 IGHV1-69D*01
IGHV2-70*04 IGHV2-70D*04
IGHV3-23*01 IGHV3-23D*01
IGHV3-30*04 IGHV3-30-3*03
IGHV3-29*01 IGHV3-30-4\2*01
IGHV3-30*18 IGHV3-30-5*01
IGHV3-30*02 IGHV3-30-5*02
IGHD4-11*01 IGHD4-4*01
IGHD5-18*01 IGHD5-5*01

Pre-processing of the data

To obtain the most reliable result we suggest following the data pre-processing steps below.

  1. Align each sequence to infer the germline VDJ alleles.
  2. Remove duplicated sequences (pRESTO[4]).
  3. Infer novel alleles (TIgGER[5]).
  4. Re-align the sequences with the extended V reference including the new alleles.
  5. Infer the individual genotype according to the following:
    • Cell type:
      • If the dataset is from PBMCs then, to avoid memory cell influence on the results, it is important to infer clones prior to genotyping. After inferring clones, choose a representative sequence with the fewest mutations for each clone.
      • If the dataset is from naive B-cells then no additional reduction is needed.
    • V coverage:
      • If the dataset sequences are with partial V coverage, first determine reliable gene and alleles, then change the non-reliable alleles’ annotation and genotype.
      • If the dataset is from naive B-cells then no additional reduction is needed.
  6. Re-align the sequences with the individual personal genotype as a reference.
  7. It is recommended to filter the sequences to only V genes with ≤3 mutations and no mutations in D gene.
  8. For best haplotype inference results, ideally the dataset final unique VDJ count should be more than 2000 sequences.

Running RAbHIT

To load RAbHIT we’ll run the following:

The functions provided by this package can be used to perform any combination of the following:

  1. Infer haplotype by anchor gene
  2. Infer D/J single chromosome deletions by the Vpooled approach
  3. Infer double chromosome deletions by relative gene usage
  4. Graphical output of the inferred haplotype
  5. Graphical output of the inferred deletions

Infer haplotype by anchor gene

Haplotype with J6 as anchor

An individual’s haplotype can be inferred using the function createFullHaplotype. The function infers the haplotype based on the provided anchor gene. Using this function, a contingency table is created for each gene, from which a strand is inferred for each allele. The user can set the anchor gene for haplotyping as well as the column for which a haplotype should be inferred.

We can plot the haplotype map using the plotHaplotype function. Each row represents a different gene and the genes are ordered by the chromosome location. Each color represents the different alleles, light gray represents an unknown, dark gray a deletion and off-white a non-reliable allele annotation. The blues represent the confidence level of the inference (lK). The most left panels show the count for each allele on each chromosome. The middle panels show the allele inference for each gene on each chromosome, The most right panels show the confidence level of the inference. non reliable alleles are annotated with [*] and a number and are found below the graph.

Haplotype with D2-21 as anchor

One of the advantages of RAbHIT, is the ability to infer haplotype by other anchor genes than J6. RAbHIT offers utilizing any gene as anchor, however, we recommend that the fraction of the least frequent allele of the anchor gene will be above 0.3. So far only J6, D2-8, and D2-21 has been proven to infer the heavy chain loci correctly. Here, we chose a single individual from our example dataset with heterozygousity in J6 and D2-21, inferred haplotypes according to both anchors, compared them using the hapDendo function, and calculated the Jaccard distance between the inferred haplotypes.

To combine the two data frames we need to associate each of the D2-21 alleles with J6 alleles. We can do that from the haplotype inference with the J6 as anchor

D2-21*02 goes with J6*02 and that D2-21*01 goes with J6*03. Now to bind the data frames together, will change the sample name to match the anchor gene used in each haplotype, and change the anchor gene columns to a generic name.

Now, we can compare the haplotypes using the hapDendo function.