Output Formats

abstar outputs annotations in AIRR-compatible format. Two file formats are supported:

  • airr: Tab-delimited TSV file with header row

  • parquet: Columnar binary format, more space-efficient for large datasets

Specifying Output Format

Command Line:

# Default: AIRR TSV
abstar run sequences.fasta output/

# Parquet format
abstar run sequences.fasta output/ -o parquet

# Both formats
abstar run sequences.fasta output/ -o airr -o parquet

Python:

import abstar

# Write to files
abstar.run("sequences.fasta", "output/", output_format=["airr", "parquet"])

# Return as DataFrame (no file output)
df = abstar.run("sequences.fasta", as_dataframe=True)

Output Directory Structure

output/
├── airr/                 # AIRR TSV files
│   └── sequences.tsv
├── parquet/              # Parquet files
│   └── sequences.parquet
└── logs/                 # Log files
    └── abstar.log

Output Fields

Core Identification

Field

Type

Description

sequence_id

String

Unique sequence identifier

sequence_input

String

Original input sequence

sequence_oriented

String

Sequence in V->J orientation

rev_comp

Boolean

True if sequence was reverse-complemented

quality

String

Quality scores (if FASTQ input)

umi

String

Unique molecular identifier (if parsed)

locus

String

Locus (e.g., IGH, IGK, IGL, TRA, TRB)

species

String

Species from germline database

germline_database

String

Name of germline database used

Gene Calls

Field

Type

Description

v_call

String

V gene assignment with allele (e.g., IGHV1-2*02)

d_call

String

D gene assignment

j_call

String

J gene assignment

c_call

String

C gene (isotype) assignment

v_gene

String

V gene without allele (e.g., IGHV1-2)

d_gene

String

D gene without allele

j_gene

String

J gene without allele

c_gene

String

C gene without allele

v_support

Float

V gene assignment E-value

d_support

Float

D gene assignment E-value

j_support

Float

J gene assignment E-value

c_support

Float

C gene assignment E-value

Regions

Field

Type

Description

fwr1, fwr1_aa

String

Framework region 1 (nucleotide, amino acid)

cdr1, cdr1_aa

String

CDR1

fwr2, fwr2_aa

String

Framework region 2

cdr2, cdr2_aa

String

CDR2

fwr3, fwr3_aa

String

Framework region 3

cdr3, cdr3_aa

String

CDR3

fwr4, fwr4_aa

String

Framework region 4

junction, junction_aa

String

Junction region (conserved C to conserved W/F)

cdr3_length

Integer

CDR3 length in amino acids

np1, np2

String

N-nucleotide regions (non-templated)

np1_length, np2_length

Integer

Length of N-regions

Junction Components

Field

Type

Description

cdr3_v, cdr3_v_aa

String

V gene contribution to CDR3

cdr3_n1, cdr3_n1_aa

String

N1 region (V-D junction)

cdr3_d, cdr3_d_aa

String

D gene contribution to CDR3

cdr3_n2, cdr3_n2_aa

String

N2 region (D-J junction)

cdr3_j, cdr3_j_aa

String

J gene contribution to CDR3

Quality Metrics

Field

Type

Description

productive

Boolean

True if sequence is productive

productivity_issues

String

List of productivity issues (if any)

stop_codon

Boolean

True if stop codon present

complete_vdj

Boolean

True if V, D (heavy only), and J assigned

v_identity

Float

V gene identity (0-1)

v_identity_aa

Float

V gene amino acid identity

j_identity

Float

J gene identity

frame

Integer

Reading frame (0, 1, or 2)

Mutations

Field

Type

Description

v_mutations

String

V gene mutations (format: “pos:ref>alt”)

v_mutations_aa

String

V gene amino acid mutations

v_mutation_count

Integer

Number of V gene mutations

v_mutation_count_aa

Integer

Number of V gene AA mutations

v_insertions

String

Non-templated insertions in V

v_deletions

String

Non-templated deletions in V

v_frameshift

Boolean

True if frameshift in V region

c_mutations

String

C gene mutations

c_mutation_count

Integer

Number of C gene mutations

Sequences

Field

Type

Description

sequence

String

V(D)J sequence (no leader, no constant)

germline

String

Corresponding germline sequence

sequence_aa

String

Amino acid sequence

germline_aa

String

Germline amino acid sequence

sequence_gapped

String

IMGT-gapped sequence

germline_gapped

String

IMGT-gapped germline

sequence_alignment

String

Aligned sequence (with gaps from alignment)

germline_alignment

String

Aligned germline

Masks

Field

Type

Description

cdr_mask

String

CDR region mask (0=FWR, 1=CDR1, 2=CDR2, 3=CDR3)

gene_segment_mask

String

Gene segment mask (V, D, J, C)

nongermline_mask

String

Mutation position mask

Position Coordinates

Field

Type

Description

v_sequence_start

Integer

V region start in input sequence

v_sequence_end

Integer

V region end in input sequence

v_germline_start

Integer

Start position in V germline

v_germline_end

Integer

End position in V germline

j_sequence_start

Integer

J region start in input sequence

j_sequence_end

Integer

J region end in input sequence