core

abstar.core.abstar.run(*args, **kwargs)

Runs abstar.

Input sequences can be provided in several different formats:

  1. individual sequences as positional arguments: run(seq1, seq2, temp=temp, output=output)
  2. a list of sequences, as an argument: run([seq1, seq2], temp=temp, output=output)
  3. a single FASTA/Q-formatted input file, passed via input
  4. a directory of FASTA/Q-formatted files, passed via input

When passing sequences (not FASTA/Q files), the sequences can be in any format recognized by abtools.sequence.Sequence, including:

  • a raw nucleotide sequence, as a string (a random sequence ID will be assigned)
  • a list/tuple of the format [sequence_id, sequence]
  • a BioPython SeqRecord object
  • an abtools Sequence object

Caution

Supplying a single input sequence in list/tuple format is not supported, as abstar assumes that each element of an iterable is a separate sequence if an iterable is the only argument. Either convert the list/tuple to a Sequence object before calling abstar.run() or supply a nested list containing the sequence information, for example: [[sequence_id, sequence], ].

Either sequences, project_dir, or all of input, output and temp are required.

Examples

If processing a single sequence, you can pass the raw sequence, as a string:

import abstar

result = abstar.run('ATGC')

or as a Sequence object:

sequence = Sequence('ATGC', id='seq1')

result = abstar.run(sequence)

If you pass just the raw sequence, a random sequence ID will be generated with uuid.uuid4(). In either case, when given a single sequence, abstar.run() will return a single Sequence object. If running multiple sequences, you can either pass each sequence as a positional argument:

result_list = run(['seq1', 'ATGC'], ['seq2', 'CGTA'])

or you can pass a list of sequences as the first argument, in this case using sequences parsed from a FASTA file using Biopython:

from Bio import SeqIO

fasta = open('my_sequences.fasta', 'r')
seqs = [s for s in SeqIO.parse(fasta, 'fasta')]
result_list = abstar.run(seqs)

When given multiple sequences, abstar.run() will return a list of abtools Sequence objects, one per input sequence.

If you’d prefer not to parse the FASTQ/A file into a list (for example, if the input file is extremely large), you can pass the input file path directly, along with a temp directory and output directory:

result_files = abstar.run(input='/path/to/my_sequences.fasta',
                          temp='/path/to/temp',
                          output='/path/to/output')

Given a file path, abstar.run() returns a list of output file paths. In the above case, result_files will be a list containing a single output file path: /path/to/output/json/my_sequences.json.

If you have a directory containing multiple FASTQ/A files, you can pass the directory path using input:

result_files = abstar.run(input='/path/to/input',
                          temp='/path/to/temp',
                          output='/path/to/output')

As before, result_files will contain a list of output file paths.

If your input directory contains paired FASTQ files (gzip compressed or uncompressed) that need to be merged prior to processing with abstar:

result_files = abstar.run(input='/path/to/input',
                          temp='/path/to/temp',
                          output='/path/to/output',
                          merge=True)

The paired read files in input will be merged with PANDAseq prior to processing with abstar. By default, PANDAseq’s ‘simple bayesian’ read merging algorithm is used, although alternate algorithms can be selected with pandaseq_algo.

abstar provides several output format options. By default, abstar will produce JSON-formatted output file. abstar’s output format options include:

json
abstar’s default format, in Javascript Object Notation (JSON) format. This format is the most comprehensive. JSON’s nesting and inclusion of programmatic objects (such as lists) make this format extremely flexible and well-suited to adaptive immune receptor sequence data, particularly for cases in which addtitional fields may be added in the future (such as clonality-related annotations).
airr
Tab-delimited format compatible with the Adaptive Immune Receptor Repertoires Community’s (AIRR-C) schema guidelines. This format contains all required fields, several “optional” fields, and several fields that are not part of the schema but conform to the naming conventions of existing schema fields (examples include v_mutations and v_mutations_aa).
tabular
Comma-delimited format containing a subset of the fields contained in the default JSON output format. This format was originally conceived for extremely large datasets, for which output size and compatibility with tabular databases (such as MySQL and Apache Spark) were high priorities.
imgt
Comma-delimited format that mimics the IMGT Summary file. This output option is provided to minimize the effort needed to convert existing IMGT-based pipelines to abstar.

Multiple output formats can be produced in a single run of abstar, although this is only available when passing an input file or directory; passing individual sequences or a list of sequences (which returns Sequence objects) can only return a single output format. To produce AIRR output:

result_files = abstar.run(input='/path/to/input',
                          temp='/path/to/temp',
                          output='/path/to/output',
                          output_type='airr')

To produce both JSON and AIRR-formatted outputs:

result_files = abstar.run(input='/path/to/input',
                          temp='/path/to/temp',
                          output='/path/to/output',
                          output_type=['json', 'airr'])

In interactive mode (providing Sequence objects rather than an input file or directory), returning AIRR-formated data can be accomplished by:

results = abstar.run(sequences, output_type='airr')
Parameters:
  • project_dir (str) – Path to the project directory. Most useful when directly downloading files from BaseSpace, and all subdirectories will be created by AbStar.
  • input (str) – Path to input directory, containing FASTA/Q files. If performing read merging with PANDAseq, paired FASTQ files may be gzip compressed.
  • output (str) – Path to output directory.
  • temp (str) – Path to temp directory, where intermediate job files will be stored.
  • log (str) – Path to log file. If not provided and project_dir is provided, the log will be written to /path/to/project_dir/abstar.log. If output is provided, log will be written to /path/to/output/abstar.log.
  • germ_db (str) – Germline database to be used. Choices are ‘human’, ‘macaque’, ‘mouse’, ‘humouse’, and ‘rabbit’. The ‘humouse’ database contains all germline genes from human and mouse databaes, and is designed to process data from humanized mouse models expressing one or more human germline genes as well as mouse germline genes. Default is ‘human’.
  • isotype (bool) – If True, the isotype will infered by aligning the sequence region downstream of the J-gene. If False, the isotype will not be determined. Default is True.
  • uid (int) – Length (in nucleotides) of the Unique Molecular ID used to barcode input RNA. A positive integer results in the UMID being parsed from the start of the read (or merged read), a negative integer results in parsing from the end of the read. Default is 0, which results in no UMID parsing.
  • gzip (bool) – If True, compresses output files with gzip. Default is False.
  • pretty (bool) – If True, formats JSON output files to be more human-readable. If False, JSON output files contain one record per line. Default is False.
  • output_type (str) – Options are ‘json’ ‘airr’, ‘tabular’, or ‘imgt’. JSON output is the most detailed. Default is ‘json’.
  • merge (bool) – If True, input must be paired-read FASTA files (gzip compressed or uncompressed) which will be merged with PANDAseq prior to processing with AbStar. If basespace is True, merge is automatically set to True. Default is False.
  • pandaseq_algo (str) – Define merging algorithm to be used by PANDAseq. Options are ‘simple_bayesian’, ‘ea_util’, ‘flash’, ‘pear’, ‘rdp_mle’, ‘stitch’, or ‘uparse’. Default is ‘simple_bayesian’, which is the default PANDAseq algorithm.
  • debug (bool) – If True, abstar.run() runs in single-threaded mode, the log is much more verbose, and temporary files are not removed. Default is False.
Returns:

If the input is a single sequence, run() returns a single abtools Sequence object.

If the input is a list of sequences, run() returns a list of abtools Sequence objects.

If the input is a file or a directory of files, run() returns a list of output files.