.. _umis: UMI Support =========== abstar can detect and extract unique molecular identifiers (UMIs) from sequencing data. UMIs are short random sequences added during library preparation that enable error correction and duplicate identification. Basic Usage ----------- **Command Line:** .. code-block:: bash # Using a built-in pattern abstar run sequences.fasta output/ --umi_pattern smartseq-human-bcr # Custom pattern with specified length abstar run sequences.fasta output/ --umi_pattern "[UMI]TCAGCGGGAAGACATT" --umi_length 12 # UMI by position only (no pattern matching) abstar run sequences.fasta output/ --umi_length 12 **Python:** .. code-block:: python import abstar # Using built-in pattern sequences = abstar.run( "sequences.fasta", umi_pattern="smartseq-human-bcr" ) # Custom pattern sequences = abstar.run( "sequences.fasta", umi_pattern="[UMI]TCAGCGGGAAGACATT", umi_length=12 ) Pattern Format -------------- UMI patterns use ``[UMI]`` as a placeholder to indicate where the UMI sequence is located, with conserved flanking sequences for anchoring: **Pattern examples:** ``"[UMI]TCAGCGGGAAGACATT"`` UMI at 5' end, followed by conserved sequence ``TCAGCGGGAAGACATT`` ``"ATGCATGC[UMI]"`` Conserved sequence ``ATGCATGC`` followed by UMI ``"ATGC[UMI]GCTA"`` UMI flanked by conserved sequences on both sides When the pattern has a trailing conserved sequence (like ``[UMI]TCAG...``), the UMI length can be inferred from the alignment. When the pattern ends with ``[UMI]`` (no trailing sequence), ``--umi_length`` is required. UMI Position (5' vs 3') ----------------------- The sign of ``--umi_length`` determines which end of the sequence to search: - **Positive length**: UMI is at the 5' end of the sequence - **Negative length**: UMI is at the 3' end of the sequence .. code-block:: bash # 12bp UMI at 5' end abstar run seqs.fasta out/ --umi_length 12 # 8bp UMI at 3' end (sequence is reverse-complemented before matching) abstar run seqs.fasta out/ --umi_length -8 When a negative length is used, the sequence is automatically reverse-complemented before pattern matching, so you can write patterns in the 5'->3' orientation of your primers. Built-in Patterns ----------------- smartseq-human-bcr ~~~~~~~~~~~~~~~~~~ For `Takara's Smart-Seq Human BCR kit`_ with UMIs. This pattern set includes primers for heavy chain (IgG, IgM, IgA, IgD, IgE) and light chains (kappa, lambda), each with a 12bp UMI. .. code-block:: bash abstar run sequences.fasta output/ --umi_pattern smartseq-human-bcr The patterns automatically handle mixed samples containing heavy, kappa, and lambda chains by attempting to match each chain-specific primer pattern. .. _Takara's Smart-Seq Human BCR kit: https://www.takarabio.com/products/next-generation-sequencing/immune-profiling/human-repertoire/smart-seq-human-bcr-with-umis Multiple UMIs ------------- If a sequence contains multiple UMIs (e.g., at both ends, or with different primers), they are concatenated with ``+``: .. code-block:: text # Single UMI umi: "ATCGATCGATCG" # Multiple UMIs umi: "ATCGATCGATCG+GCTAGCTAGCTA" Mismatch Tolerance ------------------ By default, up to 1 mismatch is allowed when matching the conserved flanking sequences. The ``smartseq-human-bcr`` pattern allows 2 mismatches. Output ------ UMIs are stored in the ``umi`` field of the annotation output: **In Python:** .. code-block:: python sequences = abstar.run("input.fasta", umi_pattern="smartseq-human-bcr") for seq in sequences: print(seq["umi"]) # e.g., "ATCGATCGATCG" **In AIRR TSV output:** The UMI appears in the ``umi`` column. **In Parquet output:** The UMI is stored in the ``umi`` field. Sequences Without UMIs ---------------------- Sequences where the UMI pattern cannot be matched (due to too many mismatches or the pattern not being found) will have a ``null``/empty UMI field but are still annotated normally.