API Examples¶
abstar and abutils both expose a public API containing many of the core functions. This makes it reasonably straightforward to build custom pipelines that include several abstar/abutils components or integrate these tools with third-party tools. A few simple examples are shown below.
Case #1¶
Sequencing data consists of an Illumina MiSeq run on human samples, with the raw data stored in BaseSpace (project ID: 123456789). Samples are indexed, so each sample will be downloaded from BaseSpace as a separate pair of read files. We’d like to do several things:
- get a FASTQC report on the raw data
- remove adapters
- quality trim
- get another FASTQC report on the cleaned data
- merge paired reads
- annotate with abstar
import os
import abstar
from abstar.utils import basespace, pandaseq
PROJECT_DIR = '/path/to/project'
PROJECT_ID = '123456789'
# download data from BaseSpace
bs_dir = os.path.join(PROJECT_DIR, 'raw_data')
basespace.download(bs_dir, project_id=PROJECT_ID)
# FASTQC on the raw data
fastqc1_dir = os.path.join(PROJECT_DIR, 'fastqc-pre')
abstar.fastqc(bs_dir, output=fastqc1_dir)
# adapter trimming
adapter_dir = os.path.join(PROJECT_DIR, 'adapter_trimed')
adapters = '/path/to/adapters.fasta'
abstar.adapter_trim(bs_dir, output=adapter_dir, adapter_both=adapters)
# quality trimming
quality_dir = os.path.join(PROJECT_DIR, 'quality_trimed')
abstar.quality_trim(adapter_dir, output=quality_dir)
# FASTQC on the cleaned data
fastqc2_dir = os.path.join(PROJECT_DIR, 'fastqc-post')
abstar.fastqc(quality_dir, output=fastqc2_dir)
# read merging
merged_dir = os.path.join(PROJECT_DIR, 'merged')
pandaseq.run(quality_dir, merged_dir)
# run abstar
temp_dir = os.path.join(PROJECT_DIR, 'temp')
json_dir = os.path.join(PROJECT_DIR, 'json')
abstar.run(input=merged_dir,
temp=temp_dir,
output=json_dir)
Case #2¶
Sequencing data is a directory of single-read FASTQ files that have already been quality/adapter trimmed. We’d like to do the following:
- get a FASTQC report
- annotate with abstar
- import the JSONs into a MongoDB database named
MyDatabase
Our FASTQ file names are formatted as: SampleNumber-SampleName.fastq
, which means the abstar output
file name would be SampleNumber-SampleName.json
. We’d like the corresponding MongoDB collection
to just be named SampleName
.
import os
import abstar
from abstar.utils import mongoimport
PROJECT_DIR = '/path/to/project'
FASTQ_DIR = '/path/to/fastqs'
MONGO_IP = '123.45.67.89'
MONGO_PORT = 27017
MONGO_USER = 'MyUsername'
MONGO_PASS = 'Secr3t'
# FASTQC on the input data
fastqc_dir = os.path.join(PROJECT_DIR, 'fastqc')
abstar.fastqc(FASTQ_DIR, output=fastqc_dir)
# run abstar
temp_dir = os.path.join(PROJECT_DIR, 'temp')
json_dir = os.path.join(PROJECT_DIR, 'json')
abstar.run(input=FASTQ_DIR,
temp=temp_dir,
output=json_dir)
# import into MongoDB
mongoimport.run(ip=MONGO_IP,
port=MONGO_PORT
user=MONGO_USER,
password=MONGO_PASS,
input=json_dir,
db='MyDatabase'
delim1='-',
delim2='.')
Case #3¶
Now we’d like to use abstar as part of an analysis script in which sequence annotation
isn’t the primary output. In the previous
examples, we started with raw(ish) sequence data and ended with either a directory of
JSON files or a MongoDB database populated with abstar output. In this case, we’re
going to start with a MongoDB database, query that database for some sequences, and
generate the unmutated common ancestor (UCA). We’d like to annotate the UCA sequence
inline (as part of the script) so that we can do world-changing things with the
annotated UCA later in our script. For simplicity’s sake, we’re querying a local MongoDB
database that doesn’t have authentication enabled, although abutils.utils.mongodb
can
work with remote MongoDB servers that require authentication.
import abstar
from abutils.utils import mongodb
from abutils.utils.sequence import Sequence
DB_NAME = 'MyDatabase'
COLLECTION_NAME = 'MyCollection'
def get_sequences(db_name, collection_name):
db = mongodb.get_db(db_name)
c = db[collection]
seqs = c.find({'chain': 'heavy'})
return [Sequence(s) for s in seqs]
def calculate_uca(sequences):
#
# code to calculate the UCA sequence, as a string
#
return uca
# get sequences, calculate the UCA
sequences = get_sequences(DB_NAME, COLLECTION_NAME)
uca_seq = calculate_uca(sequences)
# run abstar on the UCA, returns an abutils Sequence object
uca = abstar.run(['UCA', uca_seq])
# do amazing, world-changing things with the UCA
# ...
# ...
# ...