abstar: scalable AIRR annotation¶
Continuous improvements in the throughput of next-generation sequencing platforms have made adaptive immune receptor repertoire (AIRR) sequencing an increasingly important tool for detailed characterization of the immune response to infection and immunization. Accordingly, there is a need for open, scalable software for the genetic analysis of repertoire-scale antibody sequence data.
abstar is a core component of the ab[x] toolkit for antibody sequence analysis.
abstar performs V(D)J germline gene assignment and antibody sequence annotation, and can readily scale
from a single sequence to billions of sequences. abstar is fully compliant with AIRR
data standards and produces annotated sequence data in AIRR format.
usage¶
abstar can be used both as a command-line tool and as a Python API. The command-line interface
provides a straightforward way to process files containing antibody sequences, while the Python API
allows for integration of abstar’s annotation capabilities into custom analysis pipelines. For large
datasets, abstar offers distributed processing capabilities to accelerate annotation of many sequences
in parallel. Detailed usage instructions are available in the CLI and API documentation sections.
file formats¶
Input Formats: abstar accepts antibody sequences in FASTA and FASTQ formats, which are standard
formats for storing nucleotide sequences. These can be raw sequencing output files (for example, paired-end reads
from Illumina or Element sequencing platforms) or pre-processed sequence data.
Output Formats: by default, abstar generates annotation results in AIRR-compliant TSV format. We
also offer the ability to generate output in Parquet format, which is a columnar storage format that
is more space-efficient for large datasets and can be faster for certain types of analysis. In either
case (TSV or Parquet), all output adheres to the standardized AIRR schema, ensuring interoperability
with other tools in the immunoinformatics ecosystem.
germline databases¶
abstar comes pre-packaged with built-in germline databases for human, macaque, and mouse. The human and mouse databases
are based on the Open Germline Receptor Database (OGRDB) germline
reference sets. abstar also supports the use and creation of custom germline databases, which can
be used to annotate sequences from species that are not included in the built-in databases or to use
donor-specific databases created using tools like IgDiscover
or Digger.