helper utilities

abstar.utils.basespace

abstar.utils.basespace.download(download_directory, project_id=None, project_name=None)

Downloads sequencing data from BaseSpace (Illumina’s cloud storage platform).

Before accessing BaseSpace through the AbStar API, you need to set up a credentials file:

  1. You need a BaseSpace access token. The easiest way to do this is to set up a BaseSpace developer account following these instructions

  2. Make a BaseSpace credentials file using your developer credentials:

    $ make_basespace_credfile
    

and follow the instructions.

Examples

If you know the name of the project you’d like to download:

from abstar.utils import basespace

basespace.download('/path/to/download_directory', project_name='MyProject')

If you know the ID of the project you’d like to download:

basespace.download('/path/to/download_directory', project_id='ABC123')

If neither project_id nor project_name is provided, a list of your available BaseSpace projects will be provided and you can select a project from that list:

basespace.download('/path/to/download_directory')
Parameters:
  • download_directory (str) – Directory into which the raw sequences files should be downloaded. If the directory does not exist, it will be created.
  • project_id (str) – ID of the project to be downloaded.
  • project_name (str) – Name of the project to be downloaded.
Returns:

The number of sequence files downloaded.

Return type:

int

abstar.utils.mongoimport

abstar.utils.mongoimport.run(**kwargs)

Imports one or more JSON files into a MongoDB database.

Examples

To import a single JSON file into MyDatabase on a local MongoDB database:

from abstar.utils import mongoimport

mongoimport.run(input='/path/to/MySequences.json', db='MyDatabase')

This will result in a collection named ‘MySequences.json’ being created in MyDatabase on your local MongoDB instance (if it doesn’t already exist) and the data from MySequences.json being imported into that collection.

Doing the same thing, but with a remote MongoDB server running on port 27017:

mongoimport.run(ip='123.45.67.89',
                user='my_username',
                password='Secr3t',
                input='/path/to/MySequences.json',
                db='MyDatabase')

But what if we want the collection name to be different than the file name? We can truncate the filename at the first occurance of any given pattern with delim1:

mongoimport.run(input='/path/to/MySequences.json,
                db='MyDatabase',
                delim1='.')

In this case, the collection name is created by truncating the input file name at the first occurance of ., so the collection name would be MySequences. We can also truncate the filename at the Nth occurance of any given pattern by using delim1 with split1_pos:

mongoimport.run(input='/path/to/my_sequences_2016-01-01.json,
                db='MyDatabase',
                delim1='_',
                split1_pos=2)

which results in a collection name of my_sequences.

If we have more complex filenames, we can use delim1 in combination with delim2. When delim1 and delim2 are used together, delim1 becomes the pattern used to cut the filename on the left and delim2 is used to cut the filename on the right. For example, if our filename is plate-2_SampleName-01_redo.json and we want the collection to be named SampleName, we would set delim1 to _ and delim2 to -. We also need to specify that we want to cut at the second occurance of delim2, which we can do with split2_pos:

mongoimport.run(input='/path/to/plate-2_SampleName-01_redo.json,
                db='MyDatabase',
                delim1='_',
                delim2='-',
                split2_pos=2)

Trimming filenames this way is nice, but it becomes much more useful if you’re importing more than one file at a time. mongoimport.run() will accept a list of file names, and will generate separate collection names for each input file:

files = ['/path/to/A01-Sample01_2016-01-01',
         '/path/to/A02-Sample02_2016-01-01',
         '/path/to/A03-Sample03_2016-01-01]

mongoimport.run(input=files,
                db='MyDatabase',
                delim1='-',
                delim2='_')

The three input files will be imported into collections Sample01, Sample02 and Sample03, respectively. Finally, you can pass the path to a directory containing oen or more JSON files, and all the JSON files will be imported:

mongoimport.run(input='/path/to/output/directory',
                db='MyDatabase',
                delim1='-',
                delim2='_')
Parameters:
  • input (str, list) –

    Input is required and may be one of three things:

    1. A list/tuple of JSON file paths
    2. A path to a single JSON file
    3. A path to a directory containing one or more JSON files.
  • ip (str) – The IP address of the MongoDB server. Default is ‘localhost’.
  • port (int) – MongoDB port. Default is 27017.
  • user (str) – Username with which to connect to the MongoDB database. If either of user or password is not provided, mongoimport.run() will attempt to connect to the MongoDB database without authentication.
  • password (str) – Password with which to connect to the MongoDB database. If either of user or password is not provided, mongoimport.run() will attempt to connect to the MongoDB database without authentication.
  • db (str) – Name of the MongoDB database for import. Required.
  • log (str) – Path to a logfile. If not provided log information will be written to stdout.
  • delim1 (str) – Pattern on which to split the input file to generate the collection name. Default is None, which results in the file name being used as the collection name.
  • split1_pos (int) – Occurance of delim1 on which to split the input file name. Default is 1.
  • delim2 (str) – Second pattern on which to split the input file name to generate the collection name. Default is None, which results in only delim1 being used.
  • split2_pos (int) – Occurance of delim2 on which to split the input file name. Default is 1.

abstar.utils.pandaseq

abstar.utils.pandaseq.run(input, output, algorithm=u'simple_bayesian', nextseq=False)

Merge paired-end FASTQ files with PANDAseq.

Examples

To merge a directory of raw (gzip compressed) files from a MiSeq run:

merged_files = run('/path/to/input', '/path/to/output')

Same as above, but using the Pear read merging algorithm:

merged_files = run('/path/to/input', '/path/to/output', algorithm='pear')

To merge a list of file pairs:

file_pairs = [(sample1_R1.fastq, sample1_R2.fastq),
              (sample2_R1.fastq.gz, sample2_R2.fastq.gz),
              (sample3_R1.fastq, sample3_R2.fastq)]
merged_files = run(file_pairs, '/path/to/output')
Parameters:
  • input (str, list) –

    Input can be one of three things:

    1. path to a directory of paired FASTQ files
    2. a list of paired FASTQ files
    3. a list of read pairs, with each read pair being a list/tuple containing paths to two paired read files

    Regardless of what input type is provided, paired FASTQ files can be either gzip compressed or uncompressed.

    When providing a list of files or a directory of files, it is assumed that all files follow Illumina naming conventions. If your file names aren’t Illumina-like, submit your files as a list of read pairs to ensure that the proper pairs of files are merged.

  • output (str) – Path to an output directory, into which merged FASTQ files will be deposited. To determine the filename for the merged file, the R1 file (or the first file in the read pair) is split at the first occurance of the ‘_’ character. Therefore, the read pair ['my-sequences_R1.fastq', 'my-sequences_R2.fastq'] would be merged into my-sequences.fasta.
  • algorithm (str) – PANDAseq algorithm to be used for merging reads. Choices are: ‘simple_bayesian’, ‘ea_util’, ‘flash’, ‘pear’, ‘rdp_mle’, ‘stitch’, or ‘uparse’. Default is ‘simple_bayesian’, which is the default PANDAseq algorithm.
  • nextseq (bool) – Set to True if the sequencing data was generated on a NextSeq. Needed because the naming conventions for NextSeq output files differs from MiSeq output.
Returns:

a list of merged file paths

Return type:

list