abstar: scalable AIRR annotation

abstar is a tool for VDJ germline gene assignment and antibody/TCR sequence annotation. It performs germline gene assignment using MMseqs2 and detailed sequence annotation including mutations, indels, regions (CDR/FWR), and productivity assessment. Scalable from single sequences to billions.

Key Features

  • Fast: MMseqs2-powered germline assignment scales to billions of sequences

  • AIRR-compliant: Full compatibility with AIRR data standards

  • BCR and TCR: Support for both B-cell and T-cell receptor sequences

  • Flexible output: AIRR TSV or Parquet formats

  • UMI support: Built-in UMI extraction and parsing

  • Read merging: Automatic paired-end read merging with fastp

  • Custom germline databases: Build databases from OGRDB, IgDiscover, Digger, or FASTA

Quick Example

Command Line:

# Install
pip install abstar

# Annotate human BCR sequences
abstar run sequences.fasta output_dir/

# TCR sequences
abstar run tcr.fasta output_dir/ --receptor tcr

Python:

import abstar

# Return annotated Sequence objects
sequences = abstar.run("sequences.fasta")

# Return as polars DataFrame
df = abstar.run("sequences.fasta", as_dataframe=True)

Input and Output

Input: FASTA or FASTQ files (gzip-compressed supported)

Output: AIRR-compliant TSV or Parquet files containing:

  • V(D)J gene assignments

  • CDR/FWR region sequences

  • Mutation and indel annotations

  • Productivity assessment

  • Junction/CDR3 analysis

Germline Databases

Built-in databases: human, mouse, macaque, humouse

Human and mouse databases are based on the OGRDB germline reference sets. Custom databases can be built from FASTA/JSON files.

Index