gvar.dataset - Random Data Sets

Introduction

gvar.dataset contains a several tools for collecting and analyzing random samples from arbitrary distributions. The random samples are represented by lists of numbers or arrays, where each number/array is a new sample from the underlying distribution. For example, six samples from a one-dimensional gaussian distribution, 1±1, might look like

>>> random_numbers = [1.739, 2.682, 2.493, -0.460, 0.603, 0.800]

while six samples from a two-dimensional distribution, [1±1, 2±1], might be

>>> random_arrays = [[ 0.494, 2.734], [ 0.172, 1.400], [ 1.571, 1.304],
...                  [ 1.532, 1.510], [ 0.669, 0.873], [ 1.242, 2.188]]

Samples from more complicated multidimensional distributions are represented by dictionaries whose values are lists of numbers or arrays: for example,

>>> random_dict = dict(n=random_numbers, a=random_arrays)

where list elements random_dict['n'][i] and random_dict['a'][i] are part of the same multidimensional sample for every i — that is, the lists for different keys in the dictionary are synchronized one with the other.

With large samples, we typically want to estimate the mean value of the underlying distribution. This is done using gvar.dataset.avg_data(): for example,

>>> print(avg_data(random_numbers))
1.31(45)

indicates that 1.31(45) is our best guess, based only upon the samples in random_numbers, for the mean of the distribution from which those samples were drawn. Similarly

>>> print(avg_data(random_arrays))
[0.95(22) 1.67(25)]

indicates that the means for the two-dimensional distribution behind random_arrays are [0.95(22), 1.67(25)]. avg_data() can also be applied to a dictionary whose values are lists of numbers/arrays: for example,

>>> print(avg_data(random_dict))
{'a': array([0.95(22), 1.67(25)], dtype=object),'n': 1.31(45)}

Class gvar.dataset.Dataset can be used to assemble dictionaries containing random samples. For example, imagine that the random samples above were originally written into a file, as they were generated:

# file: datafile
n 1.739
a [ 0.494, 2.734]
n 2.682
a [ 0.172, 1.400]
n 2.493
a [ 1.571, 1.304]
n -0.460
a [ 1.532, 1.510]
n 0.603
a [ 0.669, 0.873]
n 0.800
a [ 1.242, 2.188]

Here each line is a different random sample, either from the one-dimensional distribution (labeled n) or from the two-dimensional distribution (labeled a). Assuming the file is called datafile, this data can be read into a dictionary, essentially identical to the data dictionary above, using:

>>> data = Dataset("datafile")
>>> print(data['a'])
[array([ 0.494, 2.734]), array([ 0.172, 1.400]), array([ 1.571, 1.304]) ... ]
>>> print(avg_data(data['n']))
1.31(45)

The brackets and commas can be omitted in the input file for one-dimensional arrays: for example, datafile (above) could equivalently be written

# file: datafile
n 1.739
a 0.494 2.734
n 2.682
a 0.172 1.400
...

Other data formats may also be easy to use. For example, a data file written using yaml would look like

# file: datafile
---
n: 1.739
a: [ 0.494, 2.734]
---
n: 2.682
a: [ 0.172, 1.400]
.
.
.

and could be read into a gvar.dataset.Dataset using:

import yaml

data = Dataset()
with open("datafile", "r") as dfile:
    for d in yaml.load_all(dfile.read()):   # iterate over yaml records
        data.append(d)                      # d is a dictionary

Finally note that data can be binned, into bins of size binsize, using gvar.dataset.bin_data(). For example, gvar.dataset.bin_data(data, binsize=3) replaces every three samples in data by the average of those samples. This creates a dataset that is 1/3 the size of the original but has the same mean. Binning is useful for making large datasets more manageable, and also for removing sample-to-sample correlations. Over-binning, however, erases statistical information.

Class gvar.dataset.Dataset can also be used to build a dataset sample by sample in code: for example,

>>> a = Dataset()
>>> a.append(n=1.739, a=[ 0.494, 2.734])
>>> a.append(n=2.682, a=[ 0.172, 1.400])
...

creates the same dataset as above.

Functions

The functions defined in the module are:

gvar.dataset.avg_data(data, spread=False, median=False, bstrap=False, noerror=False, warn=True)

Average random data to estimate mean.

data is a list of random numbers, a list of random arrays, or a dictionary of lists of random numbers and/or arrays: for example,

>>> random_numbers = [1.60, 0.99, 1.28, 1.30, 0.54, 2.15]
>>> random_arrays = [[12.2,121.3],[13.4,149.2],[11.7,135.3],
...                  [7.2,64.6],[15.2,69.0],[8.3,108.3]]
>>> random_dict = dict(n=random_numbers,a=random_arrays)

where in each case there are six random numbers/arrays. avg_data estimates the means of the distributions from which the random numbers/arrays are drawn, together with the uncertainties in those estimates. The results are returned as a gvar.GVar or an array of gvar.GVars, or a dictionary of gvar.GVars and/or arrays of gvar.GVars:

>>> print(avg_data(random_numbers))
1.31(20)
>>> print(avg_data(random_arrays))
[11.3(1.1) 108(13)]
>>> print(avg_data(random_dict))
{'a': array([11.3(1.1), 108(13)], dtype=object),'n': 1.31(20)}

The arrays in random_arrays are one dimensional; in general, they can have any shape.

avg_data(data) also estimates any correlations between different quantities in data. When data is a dictionary, it does this by assuming that the lists of random numbers/arrays for the different data[k]s are synchronized, with the first element in one list corresponding to the first elements in all other lists, and so on. If some lists are shorter than others, the longer lists are truncated to the same length as the shortest list (discarding data samples).

There are four optional arguments. If argument spread=True each standard deviation in the results refers to the spread in the data, not the uncertainty in the estimate of the mean. The former is sqrt(N) larger where N is the number of random numbers (or arrays) being averaged:

>>> print(avg_data(random_numbers,spread=True))
1.31(50)
>>> print(avg_data(random_numbers))
1.31(20)
>>> print((0.50 / 0.20) ** 2)   # should be (about) 6
6.25

This is useful, for example, when averaging bootstrap data. The default value is spread=False.

The second option is triggered by setting median=True. This replaces the means in the results by medians, while the standard deviations are approximated by the half-width of the interval, centered around the median, that contains 68% of the data. These estimates are more robust than the mean and standard deviation when averaging over small amounts of data; in particular, they are unaffected by extreme outliers in the data. The default is median=False.

The third option is triggered by setting bstrap=True. This is shorthand for setting median=True and spread=True, and overrides any explicit setting for these keyword arguments. This is the typical choice for analyzing bootstrap data — hence its name. The default value is bstrap=False.

The fourth option is to omit the error estimates on the averages, which is triggered by setting noerror=True. Just the mean values are returned. The default value is noerror=False.

The final option warn determines whether or not a warning is issued when different components of a dictionary data set have different sample sizes.

gvar.dataset.autocorr(data)

Compute autocorrelation in random data.

data is a list of random numbers or random arrays, or a dictionary of lists of random numbers/arrays.

When data is a list of random numbers, autocorr(data) returns an array where autocorr(data)[i] is the correlation between elements in data that are separated by distance i in the list: for example,

>>> print(autocorr([2,-2,2,-2,2,-2]))
[ 1. -1.  1. -1.  1. -1.]

shows perfect correlation between elements separated by an even interval in the list, and perfect anticorrelation between elements by an odd interval.

autocorr(data) returns a list of arrays of autocorrelation coefficients when data is a list of random arrays. Again autocorr(data)[i] gives the autocorrelations for data elements separated by distance i in the list. Similarly autocorr(data) returns a dictionary when data is a dictionary.

autocorr(data) uses FFTs to compute the autocorrelations; the cost of computing the autocorrelations should grow roughly linearly with the number of random samples in data (up to logarithms).

gvar.dataset.bin_data(data, binsize=2)

Bin random data.

data is a list of random numbers or random arrays, or a dictionary of lists of random numbers/arrays. bin_data(data,binsize) replaces consecutive groups of binsize numbers/arrays by the average of those numbers/arrays. The result is new data list (or dictionary) with 1/binsize times as much random data: for example,

>>> print(bin_data([1,2,3,4,5,6,7],binsize=2))
[1.5, 3.5, 5.5]
>>> print(bin_data(dict(s=[1,2,3,4,5],v=[[1,2],[3,4],[5,6],[7,8]]),binsize=2))
{'s': [1.5, 3.5], 'v': [array([ 2.,  3.]), array([ 6.,  7.])]}

Data is dropped at the end if there is insufficient data to from complete bins. Binning is used to make calculations faster and to reduce measurement-to-measurement correlations, if they exist. Over-binning erases useful information.

gvar.dataset.bootstrap_iter(data, n=None)

Create iterator that returns bootstrap copies of data.

data is a list of random numbers or random arrays, or a dictionary of lists of random numbers/arrays. bootstrap_iter(data,n) is an iterator that returns n bootstrap copies of data. The random numbers/arrays in a bootstrap copy are drawn at random (with repetition allowed) from among the samples in data: for example,

>>> data = [1.1, 2.3, 0.5, 1.9]
>>> data_iter = bootstrap_iter(data)
>>> print(next(data_iter))
[ 1.1  1.1  0.5  1.9]
>>> print(next(data_iter))
[ 0.5  2.3  1.9  0.5]

>>> data = dict(a=[1,2,3,4],b=[1,2,3,4])
>>> data_iter = bootstrap_iter(data)
>>> print(next(data_iter))
{'a': array([3, 3, 1, 2]), 'b': array([3, 3, 1, 2])}
>>> print(next(data_iter))
{'a': array([1, 3, 3, 2]), 'b': array([1, 3, 3, 2])}

>>> data = [[1,2],[3,4],[5,6],[7,8]]
>>> data_iter = bootstrap_iter(data)
>>> print(next(data_iter))
[[ 7.  8.]
 [ 1.  2.]
 [ 1.  2.]
 [ 7.  8.]]
>>> print(next(data_iter))
[[ 3.  4.]
 [ 7.  8.]
 [ 3.  4.]
 [ 1.  2.]]

The distribution of bootstrap copies is an approximation to the distribution from which data was drawn. Consequently means, variances and correlations for bootstrap copies should be similar to those in data. Analyzing variations from bootstrap copy to copy is often useful when dealing with non-gaussian behavior or complicated correlations between different quantities.

Parameter n specifies the maximum number of copies; there is no maximum if n is None.

Classes

gvar.dataset.Dataset is used to assemble random samples from multidimensional distributions:

class gvar.dataset.Dataset

Dictionary for collecting random data.

A gvar.dataset.Dataset is an ordered dictionary whose values represent collections of random samples. Each value is a numpy array whose first index labels the random sample. Random samples can be numbers or arrays of numbers. The keys identify the quantity being sampled.

A Dataset can be assembled piece by piece, as random data is accumulated, or it can be read from a file. Consider a situation where there are four random values for a scalar s and four random values for vector v. These can be collected as follows:

>>> dset = Dataset()
>>> dset.append(s=1.1, v=[12.2, 20.6])
>>> dset.append(s=0.8, v=[14.1, 19.2])
>>> dset.append(s=0.95, v=[10.3, 19.7])
>>> dset.append(s=0.91, v=[8.2, 21.0])
>>> print(dset['s'])       # 4 random values of s
[ 1.1, 0.8, 0.95, 0.91]
>>> print(dset['v'])       # 4 random vector-values of v
[array([ 12.2,  20.6]), array([ 14.1,  19.2]), array([ 10.3,  19.7]), array([  8.2,  21. ])]

The argument to dset.append() can also be a dictionary: for example, dd = dict(s=1.1,v=[12.2,20.6]); dset.append(dd) is equivalent to the first append statement above. One can also append data key-by-key: for example, dset.append('s',1.1); dset.append('v',[12.2,20.6]) is equivalent to the first append in the example above.

Use extend in place of append to add data in batches: for example,

>>> dset = Dataset()
>>> dset.extend(s=[1.1, 0.8], v=[[12.2, 20.6], [14.1, 19.2]])
>>> dset.extend(s=[0.95, 0.91], v=[[10.3, 19.7],[8.2, 21.0]])
>>> print(dset['s'])       # 4 random values of s
[ 1.1, 0.8, 0.95, 0.91]

gives the same dataset as the first example above.

The same Dataset can also be created from a text file named 'datafile' with the following contents:

# file: datafile
s 1.1
v [12.2, 20.6]
s 0.8
v [14.1, 19.2]
s 0.95
v [10.3, 19.7]
s 0.91
v [8.2, 21.0]

Here each line consists of a key followed by a new random sample for that key. Lines that begin with # are ignored. The file is read using:

>>> dset = Dataset('datafile')
>>> print(dset['s'])
[ 1.1, 0.8, 0.95, 0.91]

Data can be binned while reading it in, which might be useful if the data set is huge or if correlations are a concern. To bin the data contained in file datafile in bins of bin size 2 we use:

>>> dset = Dataset('datafile', binsize=2)
>>> print(dset['s'])
[0.95, 0.93]

The keys read from a data file are restricted to those listed in keyword keys and those that are matched (or partially matched) by regular expression grep if one or other of these is specified: for example,

>>> dset = Dataset('datafile')
>>> print([k for k in dset])
['s', 'v']
>>> dset = Dataset('datafile', keys=['v'])
>>> print([k for k in dset])
['v']
>>> dset = Dataset('datafile', grep='[^v]')
>>> print([k for k in dset])
['s']
>>> dset = Dataset('datafile', keys=['v'], grep='[^v]')
>>> print([k for k in dset])
[]

In addition to text files, hdf5 files can also be read (provided module h5py is available): for example,

>>> dset = Dataset('datafile.h5', h5group='/mcdata')

reads the hdf5 datasets in hdf5 group '/mcdata'. An hdf5 equivalent to the text file above would contain two groups, one with key 's' that is a one-dimensional array with shape (4,), and another with key 'v' that is a two-dimensional array with shape (4, 2):

>>> import h5py
>>> for v in h5py.File('datafile.h5')['/mcdata'].values():
...     print(v)
<HDF5 dataset "s": shape (4,), type "<f8">
<HDF5 dataset "v": shape (4, 2), type "<f8">

Finally, Datasets can also be constructed from other dictionaries (including other Datasets), or lists of key-data tuples. For example,

>>> dset = Dataset('datafile')
>>> dset_binned = Dataset(dset, binsize=2)
>>> dset_v = Dataset(dset, keys=['v'])

reads data from file 'datafile' into Dataset dset, and then creates a new Dataset with the data binned (dset_binned), and another that only contains the data with key 'v' (dset_v).

Parameters:
  • inputdata (str or list or dictionary) –

    If inputdata is a string, it is the name of a file containing datasets. Two formats are supported. If the filename ends in ‘.h5’, the file is in hdf5 format, with datasets that are numpy arrays whose first index labels the random sample.

    The other file format is a text file where each line consists of a key followed by a number or array of numbers representing a new random sample associated with that key. Lines beginning with # are comments. A list of text file names can also be supplied, and text files can be compressed (with names ending in .gz or .bz2).

    If inputdata is a dictionary or a list of (key,value) tuples, its keys and values are copied into the dataset. Its values should be arrays whose first index labels the random sample.

  • binsize (int) – Bin the random samples in bins of size binsize. Default value is binsize=1 (i.e., no binning).
  • grep (str or None) – If not None, only keys that match or partially match regular expression grep are retained in the data set. Keys that don’t match are ignored. Default is grep=None.
  • keys (list) – List of keys to retain in data set. Keys that are not in the list are ignored. Default is keys=None which implies that all keys are kept.
  • h5group (str or list) – Address within the hdf5 file identified by inputdata that contains the relevant datasets. Every hdf5 dataset in group h5group is read into the dataset, with the same key as in h5group. Default is the top group in the file: h5group='/'. h5group can also be a list of groups, in which case datasets from all of the groups are read.

The main attributes and methods are:

samplesize

Smallest number of samples for any key.

append(*args, **kargs)

Append data to dataset.

There are three equivalent ways of adding data to a dataset data: for example, each of

data.append(n=1.739,a=[0.494,2.734])        # method 1

data.append(n,1.739)                        # method 2
data.append(a,[0.494,2.734])

dd = dict(n=1.739,a=[0.494,2.734])          # method 3
data.append(dd)

adds one new random number to data['n'], and a new vector to data['a'].

extend(*args, **kargs)

Add batched data to dataset.

There are three equivalent ways of adding batched data, containing multiple samples for each quantity, to a dataset data: for example, each of

data.extend(n=[1.739,2.682],
            a=[[0.494,2.734],[ 0.172, 1.400]])  # method 1

data.extend(n,[1.739,2.682])                    # method 2
data.extend(a,[[0.494,2.734],[ 0.172, 1.400]])

dd = dict(n=[1.739,2.682],
            a=[[0.494,2.734],[ 0.172, 1.400]])  # method 3
data.extend(dd)

adds two new random numbers to data['n'], and two new random vectors to data['a'].

This method can be used to merge two datasets, whether or not they share keys: for example,

data = Dataset("file1")
data_extra = Dataset("file2")
data.extend(data_extra)   # data now contains all of data_extra
grep(rexp)

Create new dataset containing items whose keys match rexp.

Returns a new gvar.dataset.Dataset` containing only the items self[k] whose keys k match regular expression rexp (a string) according to Python module re:

>>> a = Dataset()
>>> a.append(xx=1.,xy=[10.,100.])
>>> a.append(xx=2.,xy=[20.,200.])
>>> print(a.grep('y'))
{'yy': [array([  10.,  100.]), array([  20.,  200.])]}
>>> print(a.grep('x'))
{'xx': [1.0, 2.0], 'xy': [array([  10.,  100.]), array([  20.,  200.])]}
>>> print(a.grep('x|y'))
{'xx': [1.0, 2.0], 'xy': [array([  10.,  100.]), array([  20.,  200.])]}
>>> print a.grep('[^y][^x]')
{'xy': [array([  10.,  100.]), array([  20.,  200.])]}

Items are retained even if rexp matches only part of the item’s key.

slice(sl)

Create new dataset with self[k] -> self[k][sl].

Parameter sl is a slice object that is applied to every item in the dataset to produce a new gvar.Dataset. Setting sl = slice(0,None,2), for example, discards every other sample for each quantity in the dataset. Setting sl = slice(100,None) discards the first 100 samples for each quantity.

If parameter sl is a tuple of slice objects, these are applied to successive indices of self[k]. An exception is called if the number of slice objects exceeds the number of dimensions for any self[k].

arrayzip(template)

Merge lists of random data according to template.

template is an array of keys in the dataset, where the shapes of self[k] are the same for all keys k in template. self.arrayzip(template) merges the lists of random numbers/arrays associated with these keys to create a new list of (merged) random arrays whose layout is specified by template: for example,

>>> d = Dataset()
>>> d.append(a=1,b=10)
>>> d.append(a=2,b=20)
>>> d.append(a=3,b=30)
>>> print(d)            # three random samples each for a and b
{'a': [1.0, 2.0, 3.0], 'b': [10.0, 20.0, 30.0]}
>>> # merge into list of 2-vectors:
>>> print(d.arrayzip(['a','b']))
[[  1.  10.]
 [  2.  20.]
 [  3.  30.]]
>>> # merge into list of (symmetric) 2x2 matrices:
>>> print(d.arrayzip([['b','a'],['a','b']]))
[[[ 10.   1.]
  [  1.  10.]]

 [[ 20.   2.]
  [  2.  20.]]

 [[ 30.   3.]
  [  3.  30.]]]

The number of samples in each merged result is the same as the number samples for each key (here 3). The keys used in this example represent scalar quantities; in general, they could be either scalars or arrays (of any shape, so long as all have the same shape).

trim()

Create new dataset where all entries have same sample size.

toarray()

Create new dictionary d where d[k]=numpy.array(self[k]) for all k.