abstar: scalable AIRR annotation¶
abstar is a tool for VDJ germline gene assignment and antibody/TCR sequence annotation. It performs germline gene assignment using MMseqs2 and detailed sequence annotation including mutations, indels, regions (CDR/FWR), and productivity assessment. Scalable from single sequences to billions.
Key Features¶
Fast: MMseqs2-powered germline assignment scales to billions of sequences
AIRR-compliant: Full compatibility with AIRR data standards
BCR and TCR: Support for both B-cell and T-cell receptor sequences
Flexible output: AIRR TSV or Parquet formats
UMI support: Built-in UMI extraction and parsing
Read merging: Automatic paired-end read merging with fastp
Custom germline databases: Build databases from OGRDB, IgDiscover, Digger, or FASTA
Quick Example¶
Command Line:
# Install
pip install abstar
# Annotate human BCR sequences
abstar run sequences.fasta output_dir/
# TCR sequences
abstar run tcr.fasta output_dir/ --receptor tcr
Python:
import abstar
# Return annotated Sequence objects
sequences = abstar.run("sequences.fasta")
# Return as polars DataFrame
df = abstar.run("sequences.fasta", as_dataframe=True)
Input and Output¶
Input: FASTA or FASTQ files (gzip-compressed supported)
Output: AIRR-compliant TSV or Parquet files containing:
V(D)J gene assignments
CDR/FWR region sequences
Mutation and indel annotations
Productivity assessment
Junction/CDR3 analysis
Germline Databases¶
Built-in databases: human, mouse, macaque, humouse
Human and mouse databases are based on the OGRDB germline reference sets. Custom databases can be built from FASTA/JSON files.