Specifying data for analysis

We introduce the concept of a “data store”. This represents the data record(s) that you want to analyse. It can be a single file, a directory of files, a zipped directory of files or a single tinydb file containing multiple data records.

We represent this concept by a DataStore class. There are different flavours of these:

  • directory based

  • zip archive based

  • TinyDB based (this is a NoSQL json based data base)

These can be read only or writable. All of these types support being indexed, iterated over, filtered, etc.. The tinydb variants do have some unique abilities (discussed below).

A read only data store

To create one of these, you provide a path AND a suffix of the files within the directory / zip that you will be analysing. (If the path ends with .tinydb, no file suffix is required.)

from cogent3.app.io import get_data_store

dstore = get_data_store("data/raw.zip", suffix="fa*", limit=5)
dstore
5x member ReadOnlyZippedDataStore(source='/Users/gavin/repos/Cogent3/doc/data/raw.zip', members=['/Users/gavin/repos/Cogent3/doc/data/raw/ENSG00000157184.fa', '/Users/gavin/repos/Cogent3/doc/data/raw/ENSG00000131791.fa', '/Users/gavin/repos/Cogent3/doc/data/raw/ENSG00000127054.fa'...)

Data store “members”

These are able to read their own raw data.

m = dstore[0]
m
'/Users/gavin/repos/Cogent3/doc/data/raw/ENSG00000157184.fa'
m.read()[:20]  # truncating
'>Human\nATGGTGCCCCGCC'

Showing the last few members

Use the head() method to see the first few.

dstore.tail()
['/Users/gavin/repos/Cogent3/doc/data/raw/ENSG00000157184.fa',
 '/Users/gavin/repos/Cogent3/doc/data/raw/ENSG00000131791.fa',
 '/Users/gavin/repos/Cogent3/doc/data/raw/ENSG00000127054.fa',
 '/Users/gavin/repos/Cogent3/doc/data/raw/ENSG00000067704.fa',
 '/Users/gavin/repos/Cogent3/doc/data/raw/ENSG00000182004.fa']

Filtering a data store for specific members

dstore.filtered("*ENSG00000067704*")
['/Users/gavin/repos/Cogent3/doc/data/raw/ENSG00000067704.fa']

Looping over a data store

for m in dstore:
    print(m)
/Users/gavin/repos/Cogent3/doc/data/raw/ENSG00000157184.fa
/Users/gavin/repos/Cogent3/doc/data/raw/ENSG00000131791.fa
/Users/gavin/repos/Cogent3/doc/data/raw/ENSG00000127054.fa
/Users/gavin/repos/Cogent3/doc/data/raw/ENSG00000067704.fa
/Users/gavin/repos/Cogent3/doc/data/raw/ENSG00000182004.fa

Making a writeable data store

The creation of a writeable data store is handled for you by the different writers we provide under cogent3.app.io.

Warning

The WritableZippedDataStore is deprecated.

TinyDB data stores are special

When you specify a TinyDB data store as your output (by using io.write_db()), you get additional features that are useful for dissecting the results of an analysis.

One important issue to note is the process which creates a TinyDB “locks” the file. If that process exits unnaturally (e.g. the run that was producing it was interrupted) then the file may remain in a locked state. If the db is in this state, cogent3 will not modify it unless you explicitly unlock it.

This is represented in the display as shown below.

dstore = get_data_store("data/demo-locked.tinydb")
dstore.describe
Unlocked db store.
record typenumber
completed175
incomplete0
logs1

3 rows x 2 columns

To unlock, you execute the following:

dstore.unlock(force=True)

Interrogating run logs

If you use the apply_to(logger=true) method, a scitrack logfile will be included in the data store. This includes useful information regarding the run conditions that produced the contents of the data store.

dstore.summary_logs
summary of log files
timenamepython versionwhocommandcomposable
2019-07-24 14:42:56load_unaligned-progressive_align-write_db-pid8650.log3.7.3gavin/Users/gavin/miniconda3/envs/c3dev/lib/python3.7/site-packages/ipykernel_launcher.py -f /Users/gavin/Library/Jupyter/runtime/kernel-5eb93aeb-f6e0-493e-85d1-d62895201ae2.jsonload_unaligned(type='sequences', moltype='dna', format='fasta') + progressive_align(type='sequences', model='HKY85', gc=None, param_vals={'kappa': 3}, guide_tree=None, unique_guides=False, indel_length=0.1, indel_rate=1e-10) + write_db(type='output', data_path='../data/aligned-nt.tinydb', name_callback=None, create=True, if_exists='overwrite', suffix='json')

1 rows x 6 columns

Log files can be accessed vial a special attribute.

dstore.logs
['load_unaligned-progressive_align-write_db-pid8650.log']

Each element in that list is a DataStoreMember which you can use to get the data contents.

print(dstore.logs[0].read()[:225])  # truncated for clarity
2019-07-24 14:42:56	Eratosthenes.local:8650	INFO	system_details : system=Darwin Kernel Version 18.6.0: Thu Apr 25 23:16:27 PDT 2019; root:xnu-4903.261.4~2/RELEASE_X86_64
2019-07-24 14:42:56	Eratosthenes.local:8650	INFO	python