abstar: scalable AIRR annotation¶

abstar is a tool for VDJ germline gene assignment and antibody/TCR sequence annotation. It performs germline gene assignment using MMseqs2 and detailed sequence annotation including mutations, indels, regions (CDR/FWR), and productivity assessment. Scalable from single sequences to billions.

Key Features¶

Fast: MMseqs2-powered germline assignment scales to billions of sequences
AIRR-compliant: Full compatibility with AIRR data standards
BCR and TCR: Support for both B-cell and T-cell receptor sequences
Flexible output: AIRR TSV or Parquet formats
UMI support: Built-in UMI extraction and parsing
Read merging: Automatic paired-end read merging with fastp
Custom germline databases: Build databases from OGRDB, IgDiscover, Digger, or FASTA

Quick Example¶

Command Line:

# Install
pip install abstar

# Annotate human BCR sequences
abstar run sequences.fasta output_dir/

# TCR sequences
abstar run tcr.fasta output_dir/ --receptor tcr

Python:

import abstar

# Return annotated Sequence objects
sequences = abstar.run("sequences.fasta")

# Return as polars DataFrame
df = abstar.run("sequences.fasta", as_dataframe=True)

Input and Output¶

Input: FASTA or FASTQ files (gzip-compressed supported)

Output: AIRR-compliant TSV or Parquet files containing:

V(D)J gene assignments
CDR/FWR region sequences
Mutation and indel annotations
Productivity assessment
Junction/CDR3 analysis

Germline Databases¶

Built-in databases: human, mouse, macaque, humouse

Human and mouse databases are based on the OGRDB germline reference sets. Custom databases can be built from FASTA/JSON files.

abstar: scalable AIRR annotation¶

Key Features¶

Quick Example¶

Input and Output¶

Germline Databases¶

Index¶