Input sequences can be provided in several different formats:
- individual sequences as positional arguments:
run(seq1, seq2, temp=temp, output=output)
- a list of sequences, as an argument:
run([seq1, seq2], temp=temp, output=output)
- a single FASTA/Q-formatted input file, passed via
- a directory of FASTA/Q-formatted files, passed via
When passing sequences (not FASTA/Q files), the sequences can be in any format recognized by
- a raw nucleotide sequence, as a string (a random sequence ID will be assigned)
- a list/tuple of the format
- a BioPython SeqRecord object
- an abtools Sequence object
Supplying a single input sequence in list/tuple format is not supported, as abstar assumes that each element of an iterable is a separate sequence if an iterable is the only argument. Either convert the list/tuple to a
Sequenceobject before calling
abstar.run()or supply a nested list containing the sequence information, for example:
[[sequence_id, sequence], ].
project_dir, or all of
If processing a single sequence, you can pass the raw sequence, as a string:
import abstar result = abstar.run('ATGC')
or as a
sequence = Sequence('ATGC', id='seq1') result = abstar.run(sequence)
If you pass just the raw sequence, a random sequence ID will be generated with
uuid.uuid4(). In either case, when given a single sequence,
abstar.run()will return a single
Sequenceobject. If running multiple sequences, you can either pass each sequence as a positional argument:
result_list = run(['seq1', 'ATGC'], ['seq2', 'CGTA'])
or you can pass a list of sequences as the first argument, in this case using sequences parsed from a FASTA file using Biopython:
from Bio import SeqIO fasta = open('my_sequences.fasta', 'r') seqs = [s for s in SeqIO.parse(fasta, 'fasta')] result_list = abstar.run(seqs)
When given multiple sequences,
abstar.run()will return a list of abtools
Sequenceobjects, one per input sequence.
If you’d prefer not to parse the FASTQ/A file into a list (for example, if the input file is extremely large), you can pass the input file path directly, along with a temp directory and output directory:
result_files = abstar.run(input='/path/to/my_sequences.fasta', temp='/path/to/temp', output='/path/to/output')
Given a file path,
abstar.run()returns a list of output file paths. In the above case,
result_fileswill be a list containing a single output file path:
If you have a directory containing multiple FASTQ/A files, you can pass the directory path using
result_files = abstar.run(input='/path/to/input', temp='/path/to/temp', output='/path/to/output')
result_fileswill contain a list of output file paths.
If your input directory contains paired FASTQ files (gzip compressed or uncompressed) that need to be merged prior to processing with abstar:
result_files = abstar.run(input='/path/to/input', temp='/path/to/temp', output='/path/to/output', merge=True)
The paired read files in
inputwill be merged with PANDAseq prior to processing with abstar. By default, PANDAseq’s ‘simple bayesian’ read merging algorithm is used, although alternate algorithms can be selected with
abstar provides several output format options. By default, abstar will produce JSON-formatted output file. abstar’s output format options include:
- Tab-delimited format compatible with the Adaptive Immune Receptor Repertoires Community’s (AIRR-C)
schema guidelines. This format contains all required fields, several “optional” fields, and
several fields that are not part of the schema but conform to the naming conventions of existing schema
fields (examples include
- Comma-delimited format containing a subset of the fields contained in the default JSON output format. This format was originally conceived for extremely large datasets, for which output size and compatibility with tabular databases (such as MySQL and Apache Spark) were high priorities.
- Comma-delimited format that mimics the IMGT Summary file. This output option is provided to minimize the effort needed to convert existing IMGT-based pipelines to abstar.
Multiple output formats can be produced in a single run of abstar, although this is only available when passing an input file or directory; passing individual sequences or a list of sequences (which returns
Sequenceobjects) can only return a single output format. To produce AIRR output:
result_files = abstar.run(input='/path/to/input', temp='/path/to/temp', output='/path/to/output', output_type='airr')
To produce both JSON and AIRR-formatted outputs:
result_files = abstar.run(input='/path/to/input', temp='/path/to/temp', output='/path/to/output', output_type=['json', 'airr'])
In interactive mode (providing
Sequenceobjects rather than an input file or directory), returning AIRR-formated data can be accomplished by:
results = abstar.run(sequences, output_type='airr')
- project_dir (str) – Path to the project directory. Most useful when directly downloading files from BaseSpace, and all subdirectories will be created by AbStar.
- input (str) – Path to input directory, containing FASTA/Q files. If performing read merging with PANDAseq, paired FASTQ files may be gzip compressed.
- output (str) – Path to output directory.
- temp (str) – Path to temp directory, where intermediate job files will be stored.
- log (str) – Path to log file. If not provided and
project_diris provided, the log will be written to
/path/to/project_dir/abstar.log. If output is provided, log will be written to
- germ_db (str) – Germline database to be used. Choices are ‘human’, ‘macaque’, ‘mouse’, ‘humouse’, and ‘rabbit’. The ‘humouse’ database contains all germline genes from human and mouse databaes, and is designed to process data from humanized mouse models expressing one or more human germline genes as well as mouse germline genes. Default is ‘human’.
- isotype (bool) – If True, the isotype will infered by aligning the sequence region downstream of the J-gene. If False, the isotype will not be determined. Default is True.
- uid (int) – Length (in nucleotides) of the Unique Molecular ID used to barcode input RNA. A positive integer results in the UMID being parsed from the start of the read (or merged read), a negative integer results in parsing from the end of the read. Default is 0, which results in no UMID parsing.
- gzip (bool) – If True, compresses output files with gzip. Default is False.
- pretty (bool) – If True, formats JSON output files to be more human-readable. If False, JSON output files contain one record per line. Default is False.
- output_type (str) – Options are ‘json’ ‘airr’, ‘tabular’, or ‘imgt’. JSON output is the most detailed. Default is ‘json’.
- merge (bool) – If True, input must be paired-read FASTA files (gzip compressed or uncompressed)
which will be merged with PANDAseq prior to processing with AbStar. If
mergeis automatically set to True. Default is False.
- pandaseq_algo (str) – Define merging algorithm to be used by PANDAseq. Options are ‘simple_bayesian’, ‘ea_util’, ‘flash’, ‘pear’, ‘rdp_mle’, ‘stitch’, or ‘uparse’. Default is ‘simple_bayesian’, which is the default PANDAseq algorithm.
- debug (bool) – If
abstar.run()runs in single-threaded mode, the log is much more verbose, and temporary files are not removed. Default is
If the input is a single sequence,
run()returns a single abtools
If the input is a list of sequences,
run()returns a list of abtools
If the input is a file or a directory of files,
run()returns a list of output files.
- individual sequences as positional arguments: