germline databasesΒΆ
abstar comes pre-packaged with built-in germline databases for human, macaque, and the C57/bl6 and BALB/c mouse strains.
The default germline database is human, but a different germline database can be specified with
--germline_database:
abstar run --germline_database balbc path/to/sequences.fasta path/to/output/
Note
The C57/bl6 database is named c57bl6, and the BALB/c database is named balbc. Both use
all lowercase letters and omit the slash (/) character.
abstar can also create custom germline databases, either for a species that is not included in
the set of built-in germline databases or for donor-specific databases created using tools like
IgDiscover or Digger.
The build_germline_database command can be used to create germline databases using one of two
different types of input files (or even a mix of the two):
FASTA-formatted files containing IMGT-gapped germline gene sequences
JSON-formatted files containing germline gene sequences in AIRR format, such as those from OGRDB.
FASTA-formatted files can be supplied using --fasta (or -f), which can be used
multiple times to specify multiple files. JSON-formatted files can be supplied using --json
(or -j), which can also be used multiple times to specify multiple files. The files can contain
a mix of V, D or J gene sequences. Constant region sequences can be supplied as FASTA-formatted file(s)
using --constants (or -c), which can also be used multiple times to specify multiple files.
An example command for creating a database named my_germline_db might look like this:
abstar build_germline_database my_germline_db -f germlines.fasta -j more_germlines.json -c constants.fasta
Germline database location
By default, germline databases are deposited in ~/.abstar/germline_dbs/. This can be changed
using the -l (or --location) option, which can be used to specify a alternative location.
abstar will only look for custom germline databases in ~/.abstar/germline_dbs/, so
the option to specify a custom location is provided primarily for testing purposes.
Warning
When running abstar, databases in ~/.abstar/germline_dbs/ will have priority over
the built-in databases. This means that if a custom database named human exists in
~/.abstar/germline_dbs/, it will be used instead of the built-in human database.
Receptor type
The -r (or --receptor) option can be used to specify the receptor type for the germline database.
The default receptor type is bcr, but tcr can also be specified.
Manifest files
An optional manifest file can be supplied using the --manifest (or -m) option. A manifest file
is a text file (of any format) that contains supplementary information about the germline database. For example,
the the manifest file could contain information about the source of the germline sequences, the download
date of the germline sequences, or any other relevant information. An example using the -m option might look
like this:
abstar build_germline_database my_germline_db -f germlines.fasta -m manifest.txt
Species names
The --include_species_in_name option can be used to include the species name in the
name of each sequence in the germline database. This option is provided primarily to simplify the creation of multi-species databases that
may result in duplicate germline gene names. This is useful when analyzing data from, for example, transgenic
mouse models that contain one or more human sequences in addition to the mouse sequences.
Note
The --include_species_in_name option is only applicable when using JSON-formatted files as input.
The resulting
germline database will have unique sequence names like so: IGHV1-2*02__homo_sapiens. When processing
data with a multi-species database, abstar will automatically remove the species when populating the
germline call fields, and the species name will be included in the species field. For example, IGHV1-2*02__homo_sapiens
will be truncated to IGHV1-2*02 when populating the v_call field, and the species field will be populated
with homo_sapiens. For example:
abstar build_germline_database my_germline_db -j human.json -j mouse.json --include_species_in_name