abstar: scalable AIRR annotation =================================================================== Continuous improvements in the throughput of next-generation sequencing platforms have made adaptive immune receptor repertoire (AIRR) sequencing an increasingly important tool for detailed characterization of the immune response to infection and immunization. Accordingly, there is a need for open, scalable software for the genetic analysis of repertoire-scale antibody sequence data. ``abstar`` is a core component of the ab[x] toolkit for antibody sequence analysis. ``abstar`` performs V(D)J germline gene assignment and antibody sequence annotation, and can readily scale from a single sequence to billions of sequences. ``abstar`` is fully compliant with `AIRR `_ data standards and produces annotated sequence data in `AIRR `_ format. usage -------- ``abstar`` can be used both as a command-line tool and as a Python API. The command-line interface provides a straightforward way to process files containing antibody sequences, while the Python API allows for integration of ``abstar``'s annotation capabilities into custom analysis pipelines. For large datasets, ``abstar`` offers distributed processing capabilities to accelerate annotation of many sequences in parallel. Detailed usage instructions are available in the CLI and API documentation sections. file formats --------------- **Input Formats**: ``abstar`` accepts antibody sequences in FASTA and FASTQ formats, which are standard formats for storing nucleotide sequences. These can be raw sequencing output files (for example, paired-end reads from Illumina or Element sequencing platforms) or pre-processed sequence data. **Output Formats**: by default, ``abstar`` generates annotation results in AIRR-compliant TSV format. We also offer the ability to generate output in Parquet format, which is a columnar storage format that is more space-efficient for large datasets and can be faster for certain types of analysis. In either case (TSV or Parquet), all output adheres to the standardized AIRR schema, ensuring interoperability with other tools in the immunoinformatics ecosystem. germline databases ------------------- ``abstar`` comes pre-packaged with built-in germline databases for human, macaque, and mouse. The human and mouse databases are based on the `Open Germline Receptor Database (OGRDB) `_ germline reference sets. ``abstar`` also supports the use and creation of custom germline databases, which can be used to annotate sequences from species that are not included in the built-in databases or to use donor-specific databases created using tools like `IgDiscover `_ or `Digger `_. .. toctree:: :maxdepth: 1 :hidden: :caption: getting started installation .. toctree:: :maxdepth: 2 :hidden: :caption: usage cli api .. toctree:: :maxdepth: 1 :hidden: :caption: about license .. toctree:: :maxdepth: 1 :hidden: :caption: related projects abutils scab index ----- * :ref:`genindex` * :ref:`modindex`