Data input formats

Contents

  • 1  Pre-requirements

    • 1.1  Import dependencies

    • 1.2  Notebook configuration

  • 2  Overview

  • 3  Points

    • 3.1  2D NumPy array of shape (n, d)

  • 4  Distances

    • 4.1  2D NumPy array of shape (n, n)

  • 5  Neighbourhoods

  • 6  Densitygraph

Pre-requirements

Import dependencies

[1]:
import sys

import matplotlib as mpl

import cnnclustering.cnn as cnn  # CNN clustering
[2]:
# Version information
print(sys.version)
3.8.3 (default, May 15 2020, 15:24:35)
[GCC 8.3.0]

Notebook configuration

[3]:
# Matplotlib configuration
mpl.rc_file(
    "matplotlibrc",
    use_default_template=False
)
[3]:
# Axis property defaults for the plots
ax_props = {
    "xlabel": None,
    "ylabel": None,
    "xlim": (-2.5, 2.5),
    "ylim": (-2.5, 2.5),
    "xticks": (),
    "yticks": (),
    "aspect": "equal"
}

# Line plot property defaults
line_props = {
    "linewidth": 0,
    "marker": '.',
}

Overview

A data set of \(n\) points can primarily be represented through point coordinates in a \(d\)-dimensional space, or in terms of a pairwise distance matrix (of arbitrary metric). Secondarily, the data set can be described by neighbourhoods (in a graph structure) with respect to a specific radius cutoff. Furthermore, it is possible to trim the neighbourhoods into a density graph containing density connected points rather then neighbours for each point. The memory demand of the input forms and the speed at which they can be clustered varies. Currently the cnnclustering.cnn module can deal with the following data structures (\(n\): number of points, \(d\): number of dimensions).

Points

  • 2D NumPy array of shape (n, d), holding point coordinates

Distances

  • 2D NumPy array of shape (n, n), holding pairwise distances

Neighbourhoods

  • 1D Numpy array of shape (n,) of 1D Numpy arrays of shape (<= n,), holding point indices

  • Python list of length (n) of Python sets of length (<= n), holding point indices

  • Sparse graph with 1D NumPy array of shape (<= ), holding point indices, and 1D NumPy array of shape (n,), holding neighbourhood start indices

Density graph

  • 1D Numpy array of shape (n,) of 1D Numpy arrays of shape (<= n,), holding point indices

  • Python list of length (n) of Python sets of length (<= n), holding point indices

  • Sparse graph with 1D NumPy array of shape (<= ), holding point indices, and 1D NumPy array of shape (n,), holding connectivity start indices

The different input structures are wrapped by corresponding classes to be handled as attributes of a CNN cluster object. Different kinds of input formats corresponding to the same data set are bundled in an Data object.

Points

2D NumPy array of shape (n, d)

The cnn module provides the class Points to handle data set point coordinates. Instances of type Points behave essentially like NumPy arrays.

[19]:
points = cnn.Points()
print("Representation of points: ", repr(points))
print("Points are Numpy arrays:  ", isinstance(points, np.ndarray))
Representation of points:  Points([], dtype=float64)
Points are Numpy arrays:   True

If you have your data points already in the format of a 2D NumPy array, the conversion into Points is straightforward and does not require any copying. Note that the dtype of Points is for now fixed to np.float_.

[42]:
original_points = np.array([[0, 0, 0],
                            [1, 1, 1]], dtype=np.float_)
points = cnn.Points(original_points)
points[0, 0] = 1
points
[42]:
Points([[1., 0., 0.],
        [1., 1., 1.]])
[43]:
original_points
[43]:
array([[1., 0., 0.],
       [1., 1., 1.]])

1D sequences are interpreted as a single point on initialisation.

[45]:
points = cnn.Points(np.array([0, 0, 0]))
points
[45]:
Points([[0., 0., 0.]])

Other sequences like lists do work as input, too but consider that this requires a copy.

[47]:
original_points = [[0, 0, 0],
                   [1, 1, 1]]
points = cnn.Points(original_points)
points
[47]:
Points([[0., 0., 0.],
        [1., 1., 1.]])

Points can be used to represent data sets distributed over multiple parts. Parts could constitute independent measurements that should be clustered together but remain separated for later analyses. Internally Points stores the underlying point coordinates always as a (vertically stacked) 2D array. Points.edges is used to track the number of points belonging to each part. The alternative constructor Points.from_parts can be used to deduce edges from parts of points passed as a sequence of 2D sequences.

[64]:
points = cnn.Points.from_parts([[[0, 0, 0],
                                 [1, 1, 1]],
                                [[2, 2, 2],
                                 [3, 3, 3]]])
points
[64]:
Points([[0., 0., 0.],
        [1., 1., 1.],
        [2., 2., 2.],
        [3., 3., 3.]])
[65]:
points.edges  # 2 parts, 2 points each
[65]:
array([2, 2])

Trying to set edges manually to a sequence not consistent with the total number of points, will raise an error. Setting the edges of an empty Points object is, however, allowed and can be used to store part information even when no points are loaded.

[66]:
points.edges = [2, 3]
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-66-4bd144cf309c> in <module>
----> 1 points.edges = [2, 3]

~/CNN/cnnclustering/cnn.py in edges(self, x)
    810
    811         if (n != 0) and (sum_edges != n):
--> 812             raise ValueError(
    813                 f"Part edges ({sum_edges} points) do not match data points "
    814                 f"({n} points)"

ValueError: Part edges (5 points) do not match data points (4 points)

Points.by_parts can be used to retrieve the parts again one by one.

[70]:
for part in points.by_parts():
    print(f"{part} \n")
[[0. 0. 0.]
 [1. 1. 1.]]

[[2. 2. 2.]
 [3. 3. 3.]]

To provide one possible way to calculate neighbourhoods from points, Points has a thin method wrapper for scipy.spatial.cKDTree. This will set Points.tree which is used by CNN.calc_neighbours_from_cKDTree. The user is encouraged to use any other external method instead.

[75]:
points.cKDTree()
points.tree
[75]:
<scipy.spatial.ckdtree.cKDTree at 0x7f0f6d3f3900>

Distances

2D NumPy array of shape (n, n)

The cnn module provides the class Distances to handle data set pairwise distances as a dense matrix. Instances of type Distances behave (like Points) much like NumPy arrays.

[79]:
distances = cnn.Distances([[0, 1], [1, 0]])
distances
[79]:
Distances([[0., 1.],
           [1., 0.]])

Distances do not support an edges attribute, i.e. can not represent part information. Use the edges of an associated Points instance instead.

Pairwise Distances can be calculated for \(n\) points within a data set from a Points instance for example with CNN.calc_dist, resulting in a matrix of shape (\(n\), \(n\)). They can be also calculated between \(n\) points in one and \(m\) points in another data set, resulting in a relative distance matrix (map matrix) of shape (\(n\), \(m\)). In the later case Distances.reference should be used to keep track of the CNN object carrying the second data set. Such a map matrix can be used to predict cluster labels for a data set based on the fitted cluster labels of another set.

Neighbourhoods

Densitygraph