UMI Support¶
abstar can detect and extract unique molecular identifiers (UMIs) from sequencing data. UMIs are short random sequences added during library preparation that enable error correction and duplicate identification.
Basic Usage¶
Command Line:
# Using a built-in pattern
abstar run sequences.fasta output/ --umi_pattern smartseq-human-bcr
# Custom pattern with specified length
abstar run sequences.fasta output/ --umi_pattern "[UMI]TCAGCGGGAAGACATT" --umi_length 12
# UMI by position only (no pattern matching)
abstar run sequences.fasta output/ --umi_length 12
Python:
import abstar
# Using built-in pattern
sequences = abstar.run(
"sequences.fasta",
umi_pattern="smartseq-human-bcr"
)
# Custom pattern
sequences = abstar.run(
"sequences.fasta",
umi_pattern="[UMI]TCAGCGGGAAGACATT",
umi_length=12
)
Pattern Format¶
UMI patterns use [UMI] as a placeholder to indicate where the UMI
sequence is located, with conserved flanking sequences for anchoring:
Pattern examples:
"[UMI]TCAGCGGGAAGACATT"UMI at 5’ end, followed by conserved sequence
TCAGCGGGAAGACATT"ATGCATGC[UMI]"Conserved sequence
ATGCATGCfollowed by UMI"ATGC[UMI]GCTA"UMI flanked by conserved sequences on both sides
When the pattern has a trailing conserved sequence (like [UMI]TCAG...),
the UMI length can be inferred from the alignment. When the pattern ends
with [UMI] (no trailing sequence), --umi_length is required.
UMI Position (5’ vs 3’)¶
The sign of --umi_length determines which end of the sequence to search:
Positive length: UMI is at the 5’ end of the sequence
Negative length: UMI is at the 3’ end of the sequence
# 12bp UMI at 5' end
abstar run seqs.fasta out/ --umi_length 12
# 8bp UMI at 3' end (sequence is reverse-complemented before matching)
abstar run seqs.fasta out/ --umi_length -8
When a negative length is used, the sequence is automatically reverse-complemented before pattern matching, so you can write patterns in the 5’->3’ orientation of your primers.
Built-in Patterns¶
smartseq-human-bcr¶
For Takara’s Smart-Seq Human BCR kit with UMIs. This pattern set includes primers for heavy chain (IgG, IgM, IgA, IgD, IgE) and light chains (kappa, lambda), each with a 12bp UMI.
abstar run sequences.fasta output/ --umi_pattern smartseq-human-bcr
The patterns automatically handle mixed samples containing heavy, kappa, and lambda chains by attempting to match each chain-specific primer pattern.
Multiple UMIs¶
If a sequence contains multiple UMIs (e.g., at both ends, or with different
primers), they are concatenated with +:
# Single UMI
umi: "ATCGATCGATCG"
# Multiple UMIs
umi: "ATCGATCGATCG+GCTAGCTAGCTA"
Mismatch Tolerance¶
By default, up to 1 mismatch is allowed when matching the conserved flanking
sequences. The smartseq-human-bcr pattern allows 2 mismatches.
Output¶
UMIs are stored in the umi field of the annotation output:
In Python:
sequences = abstar.run("input.fasta", umi_pattern="smartseq-human-bcr")
for seq in sequences:
print(seq["umi"]) # e.g., "ATCGATCGATCG"
In AIRR TSV output:
The UMI appears in the umi column.
In Parquet output:
The UMI is stored in the umi field.
Sequences Without UMIs¶
Sequences where the UMI pattern cannot be matched (due to too many mismatches
or the pattern not being found) will have a null/empty UMI field but are
still annotated normally.