Output Formats¶
abstar outputs annotations in AIRR-compatible format. Two file formats are supported:
airr: Tab-delimited TSV file with header rowparquet: Columnar binary format, more space-efficient for large datasets
Specifying Output Format¶
Command Line:
# Default: AIRR TSV
abstar run sequences.fasta output/
# Parquet format
abstar run sequences.fasta output/ -o parquet
# Both formats
abstar run sequences.fasta output/ -o airr -o parquet
Python:
import abstar
# Write to files
abstar.run("sequences.fasta", "output/", output_format=["airr", "parquet"])
# Return as DataFrame (no file output)
df = abstar.run("sequences.fasta", as_dataframe=True)
Output Directory Structure¶
output/
├── airr/ # AIRR TSV files
│ └── sequences.tsv
├── parquet/ # Parquet files
│ └── sequences.parquet
└── logs/ # Log files
└── abstar.log
Output Fields¶
Core Identification¶
Field |
Type |
Description |
|---|---|---|
|
String |
Unique sequence identifier |
|
String |
Original input sequence |
|
String |
Sequence in V->J orientation |
|
Boolean |
True if sequence was reverse-complemented |
|
String |
Quality scores (if FASTQ input) |
|
String |
Unique molecular identifier (if parsed) |
|
String |
Locus (e.g., IGH, IGK, IGL, TRA, TRB) |
|
String |
Species from germline database |
|
String |
Name of germline database used |
Gene Calls¶
Field |
Type |
Description |
|---|---|---|
|
String |
V gene assignment with allele (e.g., IGHV1-2*02) |
|
String |
D gene assignment |
|
String |
J gene assignment |
|
String |
C gene (isotype) assignment |
|
String |
V gene without allele (e.g., IGHV1-2) |
|
String |
D gene without allele |
|
String |
J gene without allele |
|
String |
C gene without allele |
|
Float |
V gene assignment E-value |
|
Float |
D gene assignment E-value |
|
Float |
J gene assignment E-value |
|
Float |
C gene assignment E-value |
Regions¶
Field |
Type |
Description |
|---|---|---|
|
String |
Framework region 1 (nucleotide, amino acid) |
|
String |
CDR1 |
|
String |
Framework region 2 |
|
String |
CDR2 |
|
String |
Framework region 3 |
|
String |
CDR3 |
|
String |
Framework region 4 |
|
String |
Junction region (conserved C to conserved W/F) |
|
Integer |
CDR3 length in amino acids |
|
String |
N-nucleotide regions (non-templated) |
|
Integer |
Length of N-regions |
Junction Components¶
Field |
Type |
Description |
|---|---|---|
|
String |
V gene contribution to CDR3 |
|
String |
N1 region (V-D junction) |
|
String |
D gene contribution to CDR3 |
|
String |
N2 region (D-J junction) |
|
String |
J gene contribution to CDR3 |
Quality Metrics¶
Field |
Type |
Description |
|---|---|---|
|
Boolean |
True if sequence is productive |
|
String |
List of productivity issues (if any) |
|
Boolean |
True if stop codon present |
|
Boolean |
True if V, D (heavy only), and J assigned |
|
Float |
V gene identity (0-1) |
|
Float |
V gene amino acid identity |
|
Float |
J gene identity |
|
Integer |
Reading frame (0, 1, or 2) |
Mutations¶
Field |
Type |
Description |
|---|---|---|
|
String |
V gene mutations (format: “pos:ref>alt”) |
|
String |
V gene amino acid mutations |
|
Integer |
Number of V gene mutations |
|
Integer |
Number of V gene AA mutations |
|
String |
Non-templated insertions in V |
|
String |
Non-templated deletions in V |
|
Boolean |
True if frameshift in V region |
|
String |
C gene mutations |
|
Integer |
Number of C gene mutations |
Sequences¶
Field |
Type |
Description |
|---|---|---|
|
String |
V(D)J sequence (no leader, no constant) |
|
String |
Corresponding germline sequence |
|
String |
Amino acid sequence |
|
String |
Germline amino acid sequence |
|
String |
IMGT-gapped sequence |
|
String |
IMGT-gapped germline |
|
String |
Aligned sequence (with gaps from alignment) |
|
String |
Aligned germline |
Masks¶
Field |
Type |
Description |
|---|---|---|
|
String |
CDR region mask (0=FWR, 1=CDR1, 2=CDR2, 3=CDR3) |
|
String |
Gene segment mask (V, D, J, C) |
|
String |
Mutation position mask |
Position Coordinates¶
Field |
Type |
Description |
|---|---|---|
|
Integer |
V region start in input sequence |
|
Integer |
V region end in input sequence |
|
Integer |
Start position in V germline |
|
Integer |
End position in V germline |
|
Integer |
J region start in input sequence |
|
Integer |
J region end in input sequence |