Python API¶
abstar provides a Python API for integrating sequence annotation into custom analysis pipelines.
The abstar.run() Function¶
The main entry point for annotation:
import abstar
# Basic usage - returns annotated Sequence objects
sequences = abstar.run("sequences.fasta")
# Return as polars DataFrame
df = abstar.run("sequences.fasta", as_dataframe=True)
# Write to project directory
abstar.run("sequences.fasta", "project/", output_format=["airr", "parquet"])
# TCR annotation
sequences = abstar.run("tcr.fasta", receptor="tcr")
# Mouse sequences with custom germline database
sequences = abstar.run("sequences.fasta", germline_database="mouse")
Parameters¶
sequencesInput sequences. Can be:
Path to a FASTA/FASTQ file
Path to a directory of FASTA/FASTQ files
A single
abutils.SequenceobjectAn iterable of
Sequenceobjects
project_path(optional)Directory for output files. If provided, results are written to disk and the function returns
None. If not provided, annotated sequences are returned.germline_databaseGermline database name. Default:
"human"Built-in options:
human,mouse,macaque,humousereceptorReceptor type:
"bcr"(default) or"tcr"output_formatOutput format(s):
"airr"(TSV),"parquet", or a list of both. Default:"airr"as_dataframeIf
True, return a polars DataFrame instead of Sequence objects. Default:Falseumi_patternPattern for UMI extraction. See UMI Support for details.
umi_lengthUMI length. Positive for 5’ end, negative for 3’ end.
mergeMerge paired-end FASTQ files before annotation. Default:
Falsemerge_kwargsAdditional arguments for the merge function as a dict.
chunksizeSequences per annotation batch. Default:
500mmseqs_chunksizeSequences per MMseqs2 batch. Default:
1000000mmseqs_threadsThreads for MMseqs2. Default: auto-detected
n_processesParallel annotation workers. Default: CPU count
verbosePrint progress information. Default:
FalsedebugRetain temp files and enable detailed logging. Default:
False
Return Types¶
When project_path is None (default):
Returns annotated abutils.Sequence objects:
sequences = abstar.run("input.fasta")
for seq in sequences:
print(seq.id)
print(seq["v_call"]) # V gene assignment
print(seq["cdr3_aa"]) # CDR3 amino acid sequence
print(seq["productive"]) # Productivity status
When as_dataframe=True:
Returns a polars DataFrame:
import polars as pl
df = abstar.run("input.fasta", as_dataframe=True)
# Filter productive sequences
productive = df.filter(pl.col("productive") == True)
# Group by V gene
v_gene_counts = df.group_by("v_gene").len()
When project_path is provided:
Returns None; writes files to project directory:
abstar.run("input.fasta", "project/")
# Output files:
# project/airr/input.tsv
# project/logs/abstar.log
Module Namespaces¶
abstar.gl - Germline Functions¶
Access germline sequences and database paths.
Get database path:
import abstar
# Get path to built-in human BCR database
path = abstar.gl.get_germline_database_path("human", receptor="bcr")
# Get path to custom database
path = abstar.gl.get_germline_database_path("my_custom_db")
Get germline sequences:
# Get a specific allele
vgene = abstar.gl.get_germline("IGHV1-2*02", "human")
print(vgene.sequence)
# Get all alleles of a gene (returns list)
alleles = abstar.gl.get_germline("IGHV1-2", "human")
# Get IMGT-gapped sequence
vgene_gapped = abstar.gl.get_germline("IGHV1-2*02", "human", imgt_gapped=True)
abstar.pp - Preprocessing¶
Paired-end read merging.
import abstar
# Merge paired FASTQ files in a directory
merged_files = abstar.pp.merge_fastqs(
"fastq_directory/",
"merged_output/",
schema="illumina" # or "element"
)
# With quality trimming options
merged_files = abstar.pp.merge_fastqs(
"fastq_directory/",
"merged_output/",
minimum_overlap=30,
quality_cutoff=20,
trim_adapters=True
)
Parameters:
schema: Filename schema ("illumina"or"element")minimum_overlap: Minimum overlap for merging (default: 30)allowed_mismatches: Allowed mismatches in overlap (default: 5)trim_adapters: Trim adapters (default: True)quality_trim: Quality trim (default: True)quality_cutoff: Quality threshold (default: 20)
abstar.tl - Tools¶
Utility functions for database building and UMI parsing.
Build custom germline database:
import abstar
abstar.tl.build_germline_database(
name="my_database",
fastas=["v_genes.fasta", "d_genes.fasta", "j_genes.fasta"],
constants=["c_genes.fasta"],
receptor="bcr"
)
Parse UMIs from sequences:
# Parse UMIs and return annotated sequences
umi = abstar.tl.parse_umis(
"sequence_string_or_file",
pattern="[UMI]TCAGCGGGAAGACATT",
length=12
)
See UMI Support for detailed UMI documentation.
Examples¶
Basic annotation pipeline:
import abstar
# Annotate sequences
sequences = abstar.run("sequences.fasta")
# Filter productive sequences
productive = [s for s in sequences if s["productive"]]
# Extract CDR3 sequences
cdr3_sequences = [s["cdr3_aa"] for s in productive]
Large-scale processing:
import abstar
# Process with multiple output formats
abstar.run(
"large_dataset/",
"output/",
output_format=["airr", "parquet"],
n_processes=16,
mmseqs_threads=8
)
DataFrame analysis:
import abstar
import polars as pl
df = abstar.run("sequences.fasta", as_dataframe=True)
# Analyze V gene usage
v_usage = (
df.filter(pl.col("productive") == True)
.group_by("v_gene")
.len()
.sort("len", descending=True)
)
print(v_usage)