Darr API Documentation

Two types of numeric data structures are supported:

Darr is a Python library for storing numeric data arrays in a format that is as open and simple as possible. It also provides easy memory-mapped access to such disk-based data using numpy indexing.

Darr objects can be created from array-like objects, such as numpy arrays and lists, using the asarray function. Alternatively, darr arrays can be created from scratch by the create_array function. Existing Darr data on disk can be accessed through the Array constructor. To remove a Darr array from disk, use delete_array.

Arrays

Accessing arrays

class darr.Array(path, accessmode='r')

Instantiate a Darr array from disk.

A darr array corresponds to a directory containing 1) a binary file with the raw numeric array values, 2) a text file (json format) describing the numeric type, array shape, and other format information, 3) a README text file documenting the data format, including examples of how to read the data in Python or Matlab.

Parameters
  • path (str or pathlib.Path) – Path to disk-based array directory.

  • accessmode ({'r', 'r+'}, default 'r') – File access mode of the darr data. r means read-only, r+ means read-write. w does not exist. To create new darr arrays, potentially overwriting an other one, use the asarray or create_array functions.

property accessmode

Data access mode of metadata, {‘r’, ‘r+’}.

append(array)

Add array-like objects to darr to the end of the dataset.

Data will be appended along the first axis. The shape of the data and the darr must be compliant. When appending data repeatedly it is more efficient to use iterappend.

Parameters

array (array-like object) – This can be a numpy array, a sequence that can be converted into a numpy array.

Returns

Return type

None

Examples

>>> import darr as da
>>> d = da.create_array('test.da', shape=(4,2), overwrite=True)
>>> d.append([[1,2],[3,4],[5,6]])
>>> print(d)
[[ 0.  0.]
 [ 0.  0.]
 [ 0.  0.]
 [ 0.  0.]
 [ 1.  2.]
 [ 3.  4.]
 [ 5.  6.]]
archive(filepath=None, compressiontype='xz', overwrite=False)

Archive array data into a single compressed file.

Parameters
  • filepath (str) – Name of the archive. In None, it will be derived from the data’s path name.

  • compressiontype (str) – One of ‘xz’, ‘gz’, or ‘bz2’, corresponding to the gzip, bz2 and lzma compression algorithms supported by the Python standard library.

  • overwrite ((True, False), optional) – Overwrites existing archive if it exists. Default is False.

Returns

The path of the created archive

Return type

pathlib.Path

Notes

See the tarfile library for more info on archiving formats

copy(path, dtype=None, chunklen=None, accessmode='r', overwrite=False)

Copy darr to a different path, potentially changing its dtype.

The copying is performed in chunks to avoid RAM memory overflow for very large darr arrays.

Parameters
  • path (str or pathlib.Path) –

  • dtype (<dtype, None>) – Numpy data type of the copy. Default is None, which corresponds to the dtype of the darr to be copied.

  • chunklen (<int, None>) – The length of chunks (along first axis) that are written during creation. If None, it is chosen so that chunks are 10 Mb in total size.

  • accessmode ({'r', 'r+'}, default 'r') – File access mode of the darr data of the returned Darr object. r means read-only, r+ means read-write.

  • overwrite ((True, False), optional) – Overwrites existing darr data if it exists. Note that a darr path is a directory. If that directory contains additional files, these will not be removed and an OSError is raised. Default is False.

Returns

copy of the darr array

Return type

Array

property datadir

Data directory object with many useful methods, such as writing information to text or json files, archiving all data, calculating checksums etc.

property dtype

Numpy data type of the array values.

property itemsize

The size in bytes of each item in the array.

iterappend(arrayiterable)

Iteratively append data from a data iterable.

The iterable has to yield chunks of data that are array-like objects compliant with Darr arrays.

Parameters

arrayiterable (an iterable that yield array-like objects) –

Returns

Return type

None

Examples

>>> import darr as da
>>> d = da.create_array('test.da', shape=(3,2), overwrite=True)
>>> def ga():
        yield [[1,2],[3,4]]
        yield [[5,6],[7,8],[9,10]]
>>> d.iterappend(ga())
>>> print(d)
[[  0.   0.]
 [  0.   0.]
 [  0.   0.]
 [  1.   2.]
 [  3.   4.]
 [  5.   6.]
 [  7.   8.]
 [  9.  10.]]
iterchunks(chunklen, stepsize=None, startindex=None, endindex=None, include_remainder=True, accessmode=None)

Iterate over data array of the darr yielding chunks of a given length and with a given stepsize.

This method keeps the underlying data file open during iteration, and is therefore relatively fast.

Parameters
  • chunklen (int) – Size of chunk for across the first axis. Note that the last chunk may be smaller than chunklen, depending on the size of the first axis.

  • stepsize (<int, None>) – Size of the shift per iteration across the first axis. Default is None, which means that stepsize equals chunklen.

  • include_remainder (<True, False>) – Determines whether remainder (< chunklen) should be included.

  • startindex (<int, None>) – Start index value. Default is None, which means to start at the beginning.

  • endindex (<int, None>) – End index value. Default is None, which means to end at the end.

  • include_remainder – Determines if the remainder at the end of the array, if it exist, should be yielded or not. The remainder is smaller than chunklen. Default is True.

  • accessmode ({'r', 'r+'}, default 'r') – File access mode of the darr data. r means read-only, r+ means read-write.

Returns

a generator that produces numpy array chunks.

Return type

generator

Examples

>>> import darr as da
>>> fillfunc = lambda i: i # fill with index number
>>> d1 = da.create_array('test1.da', shape=(12,), fillfunc=fillfunc)
>>> print(d1)
[  0.   1.   2.   3.   4.   5.   6.   7.   8.   9.  10.  11.]
>>> d2 = darr.asarray('test2.da', d.iterchunks(chunklen=2, stepsize=3))
>>> print(d2)
[  0.   1.   3.   4.   6.   7.   9.  10.]
iterindices(chunklen, stepsize=None, startindex=None, endindex=None, include_remainder=True)

Generate indices of chunks of a given length and with a given stepsize.

This method keeps the underlying data file open during iteration, and is therefore relatively fast.

Parameters
  • chunklen (int) – Size of chunk for across the first axis. Note that the last chunk may be smaller than chunklen, depending on the size of the first axis.

  • stepsize (<int, None>) – Size of the shift per iteration across the first axis. Default is None, which means that stepsize equals chunklen.

  • include_remainder (<True, False>) – Determines whether remainder (< chunklen) should be included.

  • startindex (<int, None>) – Start index value. Default is None, which means to start at the beginning.

  • endindex (<int, None>) – End index value. Default is None, which means to end at the end.

  • include_remainder – Determines if the remainder at the end of the array, if it exist, should be yielded or not. The remainder is smaller than chunklen. Default is True.

  • accessmode ({'r', 'r+'}, default 'r') – File access mode of the darr data. r means read-only, r+ means read-write.

Returns

a generator that produces numpy array chunks.

Return type

generator

Examples

>>> import darr as da
>>> d = da.create_array('test.da', shape=(12,), accesmode= 'r+')
>>> for start, end in enumerate(d.iterindices(chunklen=2, stepsize=3)):
        d[:] = 1
>>> print(d)
[ 1.  1.  0.  1.  1.  0.  1.  1.  0.  1.  1.  0.]
property mb

Array size in megabytes, excluding metadata.

property metadata

Dictionary-like interface to metadata.

property nbytes

Array size in bytes, excluding metadata.

property ndim

Number of dimensions

open_array(accessmode=None)

Open the array for efficient multiple read or write operations.

Although read and write operations can be performed conveniently using indexing notation on the Darr object, this can be relatively slow when performing multiple access operations after each other. To read data, the disk file needs to be opened, data copied into memory, and after which the file is closed. In such cases, it is much faster to first open the disk-based data.

Parameters

accessmode ({'r', 'r+'}, default 'r') – File access mode of the disk array data. r means read-only, r+ means read-write.

Yields

None

Examples

>>> import darr as da
>>> d = da.create_array('test.da', shape=(1000,3), overwrite=True)
>>> with d.open_array(accessmode='r+'):
        s1 = d[:10,1:].sum()
        s2 = d[20:25,:2].sum()
        d[500:] = 3.33
property path

File system path to array data

readcode(language)

Generate code to read the array in a different language.

Note that this does not include reading the metadata, which is just based on a text file in JSON format.

language: str

One of the languages that are supported. Choose from: ‘darr’, ‘idl’, ‘julia_ver0’, ‘julia_ver1’, ‘mathematica’, ‘matlab’, ‘maple’, ‘numpy’, ‘numpymemmap’, ‘R’.

Example

>>> import darr
>>> a = darr.asarray('test.darr', [[1,2,3],[4,5,6]])
>>> print(a.readcode('matlab'))
fileid = fopen('arrayvalues.bin');
a = fread(fileid, [3, 2], '*int32', 'ieee-le');
fclose(fileid);
property shape

Tuple with sizes of each axis of the data array.

property size

Total number of values in the data array.

Creating arrays

darr.asarray(path, array, dtype=None, accessmode='r', metadata=None, chunklen=None, overwrite=False)

Save an array or array generator as a Darr array to file system path.

Data is always written in ‘C’ order to disk, independent of the order of array.

Parameters
  • path (str or pathlib.Path) – File system path to which the array will be saved. Note that this will be a directory containing multiple files.

  • array (array-like object or generator yielding array-like objects) – This can be a numpy array, a sequence that can be converted into a numpy array, or a generator that yields such objects. The latter will be concatenated along the first dimension.

  • dtype (numpy dtype, optional) – Is inferred from the data if None. If dtype is provided the data will be cast to dtype. Default is None.

  • accessmode ({r, r+}, optional) – File access mode of the darr that is returned. r means read-only, r+ means read-write. In the latter case, data can be changed. Default r.

  • metadata ({None, dict}) – Dictionary with metadata to be saved in a separate JSON file. Default is None. If so, and the array has a ‘metadata’ attribute, Darr will try to use it as metadata of the output array.

  • chunklen (<int, None>) – The length of chunks (along first axis) that are read and written during the process. If None and the array is a numpy array or darr, it is chosen so that chunks are 10 Mb in total size. If None and array is a generator or sequence, chunklen will be 1.

  • overwrite ((True, False), optional) – Overwrites existing darr data if it exists. Note that a darr path is a directory. If that directory contains additional files, these will not be removed and an OSError is raised. Default is False.

Returns

A Darr array instance.

Return type

Array

See also

create_array()

create an array from scratch.

Examples

>>> asarray('data.da', [0,1,2,3])
darr([0, 1, 2, 3])
>>> asarray('data.da', [0,1,2,3], dtype='float64', overwrite=True)
darr([ 0.,  1.,  2.,  3.])
>>> ar = asarray('data_rw.da', [0,1,2,3,4,5], accessmode='r+')
>>> ar
darr([0, 1, 2, 3, 4, 5]) (r+)
>>> ar[-1] = 8
>>> ar
darr([0, 1, 2, 3, 4, 8]) (r+)
>>> ar[::2] = 9
darr([9, 1, 9, 3, 9, 8]) (r+)
darr.create_array(path, shape, dtype='float64', fill=None, fillfunc=None, accessmode='r+', chunklen=None, metadata=None, overwrite=False)

Create a new darr array of given shape and type, filled with predetermined values. Data is always written in ‘C’ order to disk.

Parameters
  • path (str or pathlib.Path) – File system path to which the array will be saved. Note that this will be a directory containing multiple files.

  • shape (int ot sequence of ints) – Shape of the darr.

  • dtype (dtype, optional) – The type of the darr. Default is ‘float64’

  • fill (number, optional) – The value used to fill the array with. Default is None, which will lead to the array being filled with zeros.

  • fillfunc (function, optional) – A function that generates the fill values, potentially on the basis of the index numbers of the first axis of the array. This function should only have one argument, which will be automatically provided during filling and which represents the index numbers along the first axis for all dimensions (see example below). If fillfunc is provided, fill should be None. And vice versa. Default is None.

  • accessmode (<r, r+>, optional) – File access mode of the darr data. r means real-only, r+ means read-write, i.e. values can be changed. Default r.

  • chunklen (<int, None>) – The length of chunks (along first axis) that are written during creation. If None, it is chosen so that chunks are 10 Mb in total size.

  • metadata ({None, dict}) – Dictionary with metadata to be saved in a separate JSON file. Default None

  • overwrite (<True, False>, optional) – Overwrites existing darr data if it exists. Note that a darr paths is a directory. If that directory contains additional files, these will not be removed and an OSError is raised. Default is False.

Returns

A Darr array instance.

Return type

Array

See also

asarray()

create a darr array from existing array-like object or generator.

Examples

>>> import darr as da
>>> da.create_array('testarray0', shape=(5,2))
darr([[ 0.,  0.],
   [ 0.,  0.],
   [ 0.,  0.],
   [ 0.,  0.],
   [ 0.,  0.]]) (r+)
>>> da.create_array('testarray1', shape=(5,2), dtype='int16')
darr([[0, 0],
   [0, 0],
   [0, 0],
   [0, 0],
   [0, 0]], dtype=int16) (r+)
>>> da.create_array('testarray3', shape=(5,2), fill=23.4)
darr([[ 23.4,  23.4],
   [ 23.4,  23.4],
   [ 23.4,  23.4],
   [ 23.4,  23.4],
   [ 23.4,  23.4]]) (r+)
>>> fillfunc = lambda i: i * 2
>>> da.create_array('testarray4', shape=(5,), fillfunc=fillfunc)
darr([ 0.,  2.,  4.,  6.,  8.]) (r+)
>>> fillfunc = lambda i: i * [1, 2]
>>> da.create_array('testarray4', shape=(5,2), fillfunc=fillfunc)
darr([[ 0.,  0.],
   [ 1.,  2.],
   [ 2.,  4.],
   [ 3.,  6.],
   [ 4.,  8.]]) (r+)

Deleting arrays

darr.delete_array(da)

Delete Darr array data from disk.

Parameters

da (Array or str or pathlib.Path) – The darr object to be deleted or file system path to it.

Truncating arrays

darr.truncate_array(a, index)

Truncate darr data.

Parameters
  • a (array or str or pathlib.Path) – The darr object to be truncated or file system path to it.

  • index (int) – The index along the first axis at which the darr should be truncated. Negative indices can be used but the resulting length of the truncated darr should be larger than 0 and smaller than the current length.

Examples

>>> import darr as da
>>> fillfunc = lambda i: i
>>> a = da.create_array('testarray.da', shape=(5,2), fillfunc=fillfunc)
>>> a
darr([[ 0.,  0.],
           [ 1.,  1.],
           [ 2.,  2.],
           [ 3.,  3.],
           [ 4.,  4.]]) (r+)
>>> da.truncate_array(a, 3)
>>> a
darr([[ 0.,  0.],
           [ 1.,  1.],
           [ 2.,  2.]]) (r+)
>>> da.truncate_array(a, -1)
>>> a
darr([[ 0.,  0.],
           [ 1.,  1.]]) (r+)

Ragged Arrays

Warning

Note that Ragged Arrays are still experimental! They do not support setting values yet, only creating, appending, and reading.

Accessing ragged arrays

class darr.RaggedArray(path, accessmode='r')

Disk-based sequence of arrays that may have a variable length in maximally one dimension.

property accessmode

Data access mode of metadata, {‘r’, ‘r+’}.

archive(filepath=None, compressiontype='xz', overwrite=False)

Archive array data into a single compressed file.

Parameters
  • filepath (str) – Name of the archive. In None, it will be derived from the data’s path name.

  • compressiontype (str) – One of ‘xz’, ‘gz’, or ‘bz2’, corresponding to the gzip, bz2 and lzma compression algorithms supported by the Python standard library.

  • overwrite ((True, False), optional) – Overwrites existing archive if it exists. Default is False.

Returns

The path of the created archive

Return type

pathlib.Path

Notes

See the tarfile library for more info on archiving formats

property atom

Dimensions of the non-variable axes of the arrays.

property datadir

Data directory object with many useful methods, such as writing information to text or json files, archiving all data, calculating checksums etc.

property dtype

Numpy data type of the array values.

iterappend(arrayiterable)

Iteratively append data from a data iterable.

The iterable has to yield array-like objects compliant with darr. The length of first dimension of these objects may be different, but the length of other dimensions, if any, has to be the same.

Parameters

arrayiterable (an iterable that yield array-like objects) –

Returns

Return type

None

property mb

Storage size in megabytes of the ragged array.

property metadata

Dictionary of meta data.

property narrays

Numpy data type of the array values.

property path

File system path to array data

readcode(language)

Generate code to read the array in a different language.

Note that this does not include reading the metadata, which is just based on a text file in JSON format.

language: str

One of the languages that are supported. Choose from: ‘matlab’, ‘numpymemmap’, ‘R’.

Example

>>> import darr
>>> a = darr.asraggedarray('test.darr', [[1],[2,3],[4,5,6],[7,8,9,10]], overwrite=True)
>>> print(a.readcode('matlab'))
fileid = fopen('indices/arrayvalues.bin');
i = fread(fileid, [2, 4], '*int64', 'ieee-le');
fclose(fileid);
fileid = fopen('values/arrayvalues.bin');
v = fread(fileid, 10, '*int32', 'ieee-le');
fclose(fileid);
% example to read third subarray
startindex = i(1,3) + 1;  % matlab starts counting from 1
endindex = i(2,3);  % matlab has inclusive end index
a = v(startindex:endindex);
property size

Total number of values in the data array.

Creating ragged arrays

darr.asraggedarray(path, arrayiterable, dtype=None, metadata=None, accessmode='r+', overwrite=False)
darr.create_raggedarray(path, atom=(), dtype='float64', metadata=None, accessmode='r+', overwrite=False)

Deleting ragged arrays

darr.delete_raggedarray(ra)

Delete Darr ragged array data from disk.

Parameters

path (path to data directory) –

Truncating ragged arrays

darr.truncate_raggedarray(ra, index)

Truncate darr ragged array.

Parameters
  • ra (array or str or pathlib.Path) – The darr object to be truncated or file system path to it.

  • index (int) – The index along the first axis at which the darr ragged array should be truncated. Negative indices can be used but the resulting length of the truncated darr should be 0 or larger and smaller than the current length.