cnn - A Python module for common-nearest-neighbour (CNN) clustering

Go to:


CNN class API

The functionality of this module is primarily exposed and bundled by the cnnclustering.cnn.CNN class. For hierarchical clusterings cnnclustering.cnn.CNNChild is used, too.

class cnnclustering.cnn.CNN(points: Optional[Any] = None, distances: Optional[Any] = None, neighbourhoods: Optional[Any] = None, graph: Optional[Any] = None, labels: Collection[int] = None, alias: str = 'root')

CNN cluster object

A cluster object connects input data (points, distances, neighbours, or density graphs) to cluster results (labels) and clustering methodologies (fits). It also interfaces several convenience functions.

Parameters
  • points – Argument passed on to Data to construct attribute data.

  • distances – Argument passed on to Data to construct attribute data.

  • neighbourhoods – Argument passed on to Data to construct attribute data.

  • graph – Argument passed on to Data to construct attribute data.

  • labels – Argument passed on to Labels to construct attribute labels.

  • alias – Descriptive object identifier.

data

An instance of Data.

labels

An instance of Labels.

alias

Descriptive object identifier.

hierarchy_level

Level in the cluster hierarchy of the cluster object.

summary

An instance of Summary collecting recorded cluster results.

status

Dictionary summarising the current state of the cluster object.

children

Dictionary of child cluster objects (of type CNNChild). Created from cluster label assignments by isolate().

calc_dist(other: Optional[Type[CNN]] = None, v: bool = True, method: str = 'cdist', mmap: bool = False, mmap_file: Optional[Union[pathlib.Path, str, IO[bytes]]] = None, chunksize: int = 10000, progress: bool = True, **kwargs)

Compute a distance matrix

Requires Data.points, computes distances and sets Data.distances.

Note: Currently only working for point objects of

type Points and distances of type Distances.

Parameters
  • other – If not None, a second CNN cluster object. Distances are calculated between n points in self and m points in other. If None, distances are calculated within self.

  • v – Be chatty.

  • method

    Method to compute distances with:

  • mmap – Wether to memory map the calculated distances on disk with NumPy.

  • mmap_file – If mmap is set to True, where to store the file. If None, uses a temporary file.

  • chunksize – Portions of data to process at once. Can be used to keep memory consumption low. Only useful together with mmap.

  • progress – Wether to show a progress bar.

  • **kwargs – Pass on to whatever is used as method.

Returns

None

Raises

ValueError – If method not known.

calc_neighbours_from_cKDTree(r: float, other: Optional[Type[CNN]] = None, format='array_arrays', **kwargs)

Calculate neighbourhoods from tree structure

Requires Data.points.tree, computes neighbourhoods using scipy.spatial.cKDTree.query_ball_tree and sets Data.neighbourhoods. See also Points.cKDTree() to build a suitable tree structure from data points.

Parameters
  • r – Search query radius.

  • other – If not None, another CNN instance whose data points should be used for a relative neighbour search. Also requires other.data.points.tree.

  • format

    Output format for the created neighbourhoods:

  • **kwargs – Keyword args passed on to scipy.spatial.cKDTree.query_ball_tree

calc_neighbours_from_dist(r: float, format='array_arrays')

Calculate neighbourhoods from distances

Requires Data.distances, computes neighbourhoods and sets Data.neighbourhoods.

Note: Currently only working for distance objects of

type Distances.

Parameters
  • r – Search query radius.

  • format

    Output format for the created neighbourhoods:

Returns

None

check()

Check current data state

Check if data points, distances, neighbourhoods or a density graph are present. Check depends on length of the stored objects. An empty data structure (length = 0) represents no data. Sets status.

Returns

None

cut(part: Optional[int] = None, points: Tuple[Optional[int], …] = None, dimensions: Tuple[Optional[int], …] = None)

Create a new CNN instance from a data subset

Convenience function to create a reduced cluster object. Supported are continuous slices from the original data that allow making a view instead of a copy.

Parameters
  • part – Cut out the points for exactly one part (zero based index).

  • points – Slice points by using (start:stop:step)

  • dimensions – Slice dimensions by using (start:stop:step)

Returns

CNN

dist_hist(ax: Optional[Type[matplotlib.axes._subplots.SubplotBase]] = None, maxima: bool = False, maxima_props: Optional[Dict[str, Any]] = None, hist_props: Optional[Dict[str, Any]] = None, ax_props: Optional[Dict[str, Any]] = None, inter_props: Optional[Dict[str, Any]] = None)

Plot a histogram of distances in the data set

Requires data.distances.

Parameters
  • ax – Matplotlib Axes to plot on. If None, Figure and Axes are created.

  • maxima – Whether to mark the maxima of the distribution. Uses scipy.signal.argrelextrema.

  • maxima_props – Keyword arguments passed to scipy.signal.argrelextrema if maxima is set to True.

  • hist_props – Keyword arguments passed to numpy.histogram to compute the histogram.

  • ax_props – Keyword arguments for Matplotlib Axes styling.

evaluate(ax: Optional[Type[matplotlib.axes._subplots.SubplotBase]] = None, clusters: Optional[Collection[int]] = None, original: bool = False, plot: str = 'dots', points: Optional[Tuple[Optional[int]]] = None, dim: Optional[Tuple[int, int]] = None, ax_props: Optional[Dict] = None, annotate: bool = True, annotate_pos: str = 'mean', annotate_props: Optional[Dict] = None, plot_props: Optional[Dict] = None, plot_noise_props: Optional[Dict] = None, hist_props: Optional[Dict] = None, free_energy: bool = True)

Returns a 2D plot of an original data set or a cluster result

Args: ax: The Axes instance to which to add the plot. If

None, a new Figure with Axes will be created.

clusters:

Cluster numbers to include in the plot. If None, consider all.

original:

Allows to plot the original data instead of a cluster result. Overrides clusters. Will be considered True, if no cluster result is present.

plot:

The kind of plotting method to use.

  • “dots”, ax.plot()

  • “scatter”, ax.scatter()

  • “contour”, ax.contour()

  • “contourf”, ax.contourf()

parts:

Use a slice (start, stop, stride) on the data parts before plotting.

points:

Use a slice (start, stop, stride) on the data points before plotting.

dim:

Use these two dimensions for plotting. If None, uses (0, 1).

annotate:

If there is a cluster result, plot the cluster numbers. Uses annotate_pos to determinte the position of the annotations.

annotate_pos:

Where to put the cluster number annotation. Can be one of:

  • “mean”, Use the cluster mean

  • “random”, Use a random point of the cluster

Alternatively a list of x, y positions can be passed to set a specific point for each cluster (Not yet implemented)

annotate_props:

Dictionary of keyword arguments passed to ax.annotate().

ax_props:

Dictionary of ax properties to apply after plotting via ax.set(**ax_props)(). If None, uses defaults that can be also defined in the configuration file (Note yet implemented).

plot_props:

Dictionary of keyword arguments passed to various functions (_plots.plot_dots() etc.) with different meaning to format cluster plotting. If None, uses defaults that can be also defined in the configuration file (Note yet implemented).

plot_noise_props:

Like plot_props but for formatting noise point plotting.

hist_props:

Dictionary of keyword arguments passed to functions that involve the computing of a histogram via numpy.histogram2d.

free_energy:

If True, converts computed histograms to pseudo free energy surfaces.

mask:

Sequence of boolean or integer values used for optional fancy indexing on the point data array. Note, that this is applied after regular slicing (e.g. via points) and requires a copy of the indexed data (may be slow and memory intensive for big data sets).

Returns

Figure, Axes and a list of plotted elements

fit(*args, **kwargs) → Optional[Tuple[cnnclustering.cnn.CNNRecord, bool]]

Wraps CNN clustering execution

Requires one of Data.graph, Data.neighbourhoods, Data.distances, or Data.points and sets labels.

This function prepares the clustering and calls an appropriate worker function to do the actual clustering. How the clustering is done, depends on the current data situation and the selected policy. The clustering can be done with different inputs:

The differing input structures have varying memory demands. In particular storage of distances can be costly memory-wise (memory complexity \(\mathcal{O}(n^2)\)). Ultimately, the clustering depends on neighbourhood or density graph information. The clustering is fast if neighbourhoods or a density graph are pre-computed but this has to be re-done for each radius_cutoff and/or cnn_cutoff separately. Neighbourhoods can be calculated either from data points (e.g. calc_neighbours_from_cKDTree()), or pre-computed pairwaise distances (e.g. calc_neighbours_from_dist()). The user is encouraged to apply any external method of choice to provide neighbourhoods in a format that can be clustered. If a primary input structure (points or distances) is given, neighbourhoods will be computed. If the user chooses policy = “progressive”, neighbourhoods will be computed from either distances (if present) or points before the clustering automatically. If the user chooses policy = “conservative”, neighbourhoods will be computed on-the-fly (online) from either distances (if present) or points during the clustering. This can save memory but can be computational more expensive. Caching can be used to achieve the right balance between memory usage and computing effort for your situation.

Parameters
  • radius_cutoff – Radius cutoff cluster parameter.

  • cnn_cutoff – CNN cutoff cluster parameter (similarity criterion).

  • member_cutoff – Valid clusters need to have at least this many members. Passed to Labels.sort_by_size() if sort_by_size is True. Has no effect otherwise and valid clusters have at least two members.

  • max_clusters – Keep only the largest max_clusters clusters. Passed to Labels.sort_by_size() if sort_by_size is True. Has no effect otherwise.

  • cnn_offset – Exists for compatibility reasons and is substracted from cnn_cutoff. If cnn_offset = 0, two points need to share at least cnn_cutoff neighbours to be part of the same cluster without counting any of the two points. In former versions of the clustering, self-counting was included and cnn_cutoff = 2 was equivalent to cnn_cutoff = 0 in this version.

  • info – Weather to attach LabelInfo to created Labels instance.

  • sort_by_size – Weather to sort (and trim) the created Labels instance. See also Labels.sort_by_size().

  • rec – Weather to create and return a CNNRecord instance in the end.

  • v – Be chatty.

  • policy

    Determines the computation behaviour depending on the given data situation:

    • ”progressive”: Bulk-compute neighbourhoods

      if needed.

    • ”conservative”: Compute neighbourhoods on-the-fly

      if needed.

Returns

Tuple(CNNRecord, v) if rec is True, None otherwise

Raises

AssertionError – If policy is not in [“progressive”, “conservative”]

get_dtraj()

Transform cluster labels to discrete trajectories

Returns

[description]

Return type

[type]

get_samples(kind: str = 'mean', clusters: Optional[List[int]] = None, n_samples: int = 1, by_parts: bool = True, skip: int = 0, stride: int = 1) → Dict[int, List[int]]

Select sample points from clusters

Parameters
  • kind

    How to choose the samples:

    • ”mean”:

    • ”random”:

    • ”all”:

  • clusters – List of cluster numbers to consider

  • n_samples – How many samples to return

  • byparts – Return point indices as list of lists by parts

  • skip – Skip the first n frames

  • stride – Take only every n th frame

Returns

Dictionary of sample point indices as list for each cluster

isolate(purge: bool = True) → None

Isolates points per clusters based on a cluster result

predict(**kwargs) → None

Wraps CNN cluster prediction execution

Predict labels for points in a data set (other) on the basis of assigned labels to a “train” set (self).

Parameters
  • otherCNN cluster object for whose points cluster labels should be predicted.

  • radius_cutoff – Find nearest neighbours within distance r.

  • cnn_cutoff – Points of the same cluster must have at least c common nearest neighbours (similarity criterion).

  • include_all – If False, keep cluster assignment for points in the test set that have a maximum distance of same_tol to a point in the train set, i.e. they are essentially the same point (currently not implemented).

  • same_tol – Distance cutoff to treat points as the same, if include_all is False.

  • clusters – Predict assignment of points only with respect to this list of clusters.

  • purge – If True, reinitalise predicted labels. Override assignment memory.

  • cnn_offset – Mainly for backwards compatibility. Modifies the the cnn_cutoff.

  • policy

    Determines the computation behaviour depending on the given data situation:

    • ”progressive”: Bulk-compute neighbourhoods

      if needed.

    • ”conservative”: Compute neighbourhoods on-the-fly

      if needed.

  • Returns – None

reel(deep: Optional[int] = 1) → None

Wrap up assigments of lower hierarchy levels

Parameters

deep – How many lower levels to consider. If None, consider all.


class cnnclustering.cnn.CNNChild(parent, *args, alias='child', **kwargs)

CNN cluster object subclass.

Increments the hierarchy level of the parent object when instanciated.

parent

Reference to parent


Data input formats

Input data of differing nature and format (points, distances, neighbourhoods, density graphs) are organised and bundled by the cnnclustering.cnn.Data class.

class cnnclustering.cnn.Data(points=None, distances=None, neighbourhoods=None, graph=None)

Abstraction class for handling input data

A data object bundles points, distances, neighbourhoods and density graphs.

Parameters
  • points – Used to set points. If None, creates an empty instance of Points. If an instance of Points is passed, uses this instance. If anything else is passed, tries to create an instance of Points using Points.from_parts().

  • distances – Used to set distances. If None, creates an empty instance of Distances. If an instance of Distances is passed, uses this instance. If anything else is passed, tries to create an instance of Distances using the default constructor.

  • neighbourhoods – Used to set neighbourhoods. If None, creates an empty instance of NeighbourhoodsArray. If an instance of a class qualifying as neighbourhoods object (see NeighbourhoodsABC) is passed, uses this instance. If anything else is passed, tries to create an instance of NeighbourhoodsArray using the default constructor.

  • graph – Used to set graph. If `None, creates an empty instance of DensitySparsegraphArray. If an instance of a class qualifying as density graph object (see DensitygraphABC) is passed, uses this instance. If anything else is passed, tries to create an instance of DensitySparsegraphArray using the default constructor.

shape

Dictionary summarising size of data structures.

points

Instance of Points.

distances

Instance of Distances.

neighbourhoods

Instance of subclass of NeighbourhoodsABC.

graph

Instance of subclass of DensitygraphABC.


Points

Points are currently supported in the form of a 2D NumPy array of shape (n: points, d: dimensions) through the cnnclustering.cnn.Points class.

class cnnclustering.cnn.Points(p: Optional[numpy.ndarray] = None, edges: Optional[Sequence] = None, tree: Optional[Any] = None)

Abstraction class for data points

by_parts() → Iterator

Yield data by parts

Returns

Generator of 2D numpy.ndarray s (parts)

cKDTree(**kwargs)

Wrapper for scipy.spatial.cKDTree()

Sets Points.tree.

Parameters

**kwargs – Passed to scipy.spatial.cKDTree()

classmethod from_file(f: Union[str, pathlib.Path], *args, from_parts: bool = False, **kwargs)

Alternative constructor

Load file content to be interpreted as points. Uses load() to read files.

Recognised input formats are:
  • Sequence

  • 2D Sequence (sequence of sequences all of same length)

  • Sequence of 2D sequences all of same second dimension

Parameters
  • f – File name as string or pathlib.Path object.

  • *args – Arguments passed to load().

  • from_parts – If True uses from_parts() constructor. If False uses default constructor.

Returns

Instance of Points

classmethod from_parts(p: Optional[Sequence])

Alternative constructor

Use if data is passed as collection of parts, as

>>> p = Points.from_parts([[[0, 0], [1, 1]],
...                        [[2, 2], [3,3]]])
... p
Points([[0, 0],
        [1, 1],
        [2, 2],
        [3, 3]])
Recognised input formats are:
  • Sequence

  • 2D Sequence (sequence of sequences all of same length)

  • Sequence of 2D sequences all of same second dimension

In this way, part edges are taken from the input shape and do not have to be specified explicitly. Calls get_shape().

Parameters

p – File name as string or pathlib.Path object

Returns

Instance of Points

static get_shape(data: Any)

Maintain data in universal shape (2D NumPy array)

Analyses the format of given data and fits it into the standard format (parts, points, dimensions). Creates a numpy.ndarray vstacked along the parts componenent that can be passed to the Points constructor alongside part edges. This may not be able to deal with all possible kinds of input structures correctly, so check the outcome carefully.

Recognised input formats are:
  • Sequence

  • 2D Sequence (sequence of sequences all of same length)

  • Sequence of 2D sequences all of same second dimension

Parameters

data

Either None or:

  • a 1D sequence of length d, interpreted as 1 point in d dimension

  • a 2D sequence of length n (rows) and width d (columns), interpreted as n points in d dimensions

  • a list of 2D sequences, interpreted as groups (parts) of points

Returns

Tuple of

  • NumPy array of shape (\(\sum n, d\))

  • Part edges list, marking the end points of the parts

static load(f: Union[pathlib.Path, str], *args, **kwargs) → None

Loads file content

Depending on the filename extension, a suitable loader is called:

  • .p: pickle.load()

  • .npy: numpy.load()

  • None: numpy.loadtxt()

  • .xvg, .dat: numpy.loadtxt()

Sets data and shape.

Parameters
  • f – File

  • *args – Passed to loader

Keyword Arguments

**kwargs – Passed to loader

Returns

Return value of the loader


Distances

Distances are currently supported in the form of a 2D NumPy array of shape (n: points, m: points) through the cnnclustering.cnn.Distances class.

class cnnclustering.cnn.Distances(d: Optional[numpy.ndarray] = None, reference=None)

Abstraction class for data point distances


Neighbourhoods

Neighbourhoods are currently supported in the form of:

Valid neighbourhood containers should inherit from the absract base class cnnclustering.cnn.NeighbourhoodsABC or a the general realisation cnnclustering.cnn.Neighbourhoods.

class cnnclustering.cnn.NeighbourhoodsArray(sequence: Optional[Sequence[int]] = None, radius=None, reference=None, self_counting=False)

Array of array realisation of neighbourhood abstraction

Inherits from numpy.ndarray and Neighbourhoods.

property n_neighbours

Return number of neighbours for each point


class cnnclustering.cnn.NeighbourhoodsList(neighbourhoods=None, radius=None, reference=None)

List of sets realisation of neighbourhood abstraction

Inherits from collections.UserList and Neighbourhoods.

property n_neighbours

Return number of neighbours for each point


class cnnclustering.cnn.Neighbourhoods(neighbourhoods=None, radius=None, reference=None, self_counting=False)

Basic realisation of neighbourhood abstraction

Inherits from NeighboursABC.

Makes no assumptions on the nature of the stored neighbours and provides default implementations for the required attributes by NeighboursABC. Since working realisations of the Neighbourhoods base class usually inherit with priority from a collection type whose __init__ mechanism is probably used, the alternative method init_finalise is offered to set the required attributes.

property n_neighbours

Return number of neighbours for each point

property radius

Return radius

property reference

Return reference CNN instance

property self_counting

Self-counting of points as their own neighbours?


class cnnclustering.cnn.NeighbourhoodsABC

Abstraction class for neighbourhoods

Neighbourhoods (integer point indices) can be stored in different data structures (non-exhaustive listing):

Collection of collections:

Linear collection plus slice indicator:

  • array of neighbours plus array of starting indices (NeighbourhoodsSparsegraphArray)

  • array in which one element indicates the length and the following elements are neighbours (NeighbourhoodsLinear)

To qualify as a neighbourhoods container, the following attributes should be present in any case:

radius: Points are neighbours of each other with respect to

this radius (any metric).

reference: A CNN instance if neighbourhoods are valid for

points in different data sets.

n_neighbours: Return the neighbourcount for each point in

the container.

self_counting: Boolean indicator, True if neighbourhoods include

self-counting (point is its own neighbour).

__str__: A useful str-representation revealing the type and

the radius.

abstract property n_neighbours

Return number of neighbours for each point

abstract property radius

Return radius

abstract property reference

Return reference CNN instance

abstract property self_counting

Self-counting of points as their own neighbours?


Density graphs

Density graphs are currently supported in the form of a sparse graphs through the cnnclustering.cnn.SparsegraphArray class.

Valid density graph containers should inherit from the absract base class cnnclustering.cnn.DensitygraphABC or a the general realisation cnnclustering.cnn.Densitygraph.

class cnnclustering.cnn.DensitySparsegraphArray

class cnnclustering.cnn.Densitygraph

class cnnclustering.cnn.DensitygraphABC

Cluster results

Cluster results can be recorded by cnnclustering.cnn.CNN.fit() as cnnclustering.cnn.CNNRecord. Multiple records are collected in by cnnclustering.cnn.Summary. Cluster label assignments are stored independently and are currently supported through cnnclustering.cnn.Labels. Labels can be put into context by attaching cnnclustering.cnn.LabelInfo.

class cnnclustering.cnn.CNNRecord(points: int, r: float, c: int, min: int, max: int, clusters: int, largest: float, noise: float, time: float)

Cluster result container

CNNRecord instances can be returned by CNN.fit() and are collected in Summary.

c: int

CNN cutoff c (similarity criterion)

clusters: int

Number of identified clusters.

largest: float

Ratio of points in the largest cluster.

max: int

Maximum cluster number. After sorting, only the biggest max clusters are kept.

min: int

Member cutoff. Valid clusters have at least this many members.

noise: float

Ratio of points classified as outliers.

points: int

Number of points in the clustered data set.

r: float

Radius cutoff r.

time: float

Measured execution time for the fit, including sorting in seconds.


class cnnclustering.cnn.Summary(iterable=None)

List like container for cluster results

Stores instances of CNNRecord.

insert(index, item)

S.insert(index, value) – insert value before index

summarize(ax: Optional[Type[matplotlib.axes._subplots.SubplotBase]] = None, quant: str = 'time', treat_nan: Optional[Any] = None, ax_props: Optional[Dict] = None, contour_props: Optional[Dict] = None)

Generate a 2D plot of record values

Record values (“time”, “clusters”, “largest”, “noise”) are plotted against cluster parameters (radius cutoff r and cnn cutoff c).

Parameters
  • ax – Matplotlib Axes to plot on. If None, a new Figure with Axes will be created.

  • quant

    Record value to visualise:

    • ”time”

    • ”clusters”

    • ”largest”

    • ”noise”

  • treat_nan – If not None, use this value to pad nan-values.

  • ax_props – Used to style ax.

  • contour_props – Passed on to contour.

to_DataFrame()

Convert list of records to (typed) pandas DataFrame

Returns

TypedDataFrame


class cnnclustering.cnn.Labels(sequence: Optional[Sequence[int]] = None, info=None, consider=None)

Cluster label assignments

Inherits from numpy.ndarray.

Parameters
  • sequence – Any 1D sequence that can be converted to a NumPy array of integers representing cluster label assignments. If None, will create an empty instance.

  • info – Instance of LabelInfo metadata.

  • consider – Any 1D sequence matching the length of sequence of 0 and 1 used to set consider.

info

Instance of LabelInfo metadata.

consider

Array of 0 and 1 of same length as labels, indicating which labels should be still considered (e.g. for predictions).

static dict2labels(dictionary: Dict[int, Collection[int]]) → Type[numpy.ndarray]

Convert cluster dictionary to labels

Parameters

dictionary – Dictionary of point indices per cluster label to convert

Returns

Sequenc of labels for each point as NumPy ndarray

fix_missing()

Fix missing cluster labels and ensure continuous numbering

If you also want the labels to be sorted by clustersize use sort_by_size() instead, which re-numbers clusters, too.

static labels2dict(labels: Collection[int]) → Dict[int, Set[int]]

Convert labels to cluster dictionary

Parameters

labels – Sequence of integer cluster labels to convert

Returns

Dictionary of sets of point indices with cluster labels as keys

merge(clusters: List[int]) → None

Merge a list of clusters into one

sort_by_size(member_cutoff=None, max_clusters=None)

Sort labels by clustersize in-place

Re-assigns cluster numbers so that the biggest cluster (that is not noise) is cluster 1. Also filters clusters out, that have not at least member_cutoff members. Optionally, does only keep the max_clusters largest clusters. Returns the member count in the largest cluster and the number of points declared as noise.

Parameters
  • member_cutoff – Valid clusters need to have at least this many members.

  • max_clusters – Only keep this many clusters.

Returns

(#member largest, #member noise)

trash(clusters: List[int]) → None

Merge a list of clusters into noise


class cnnclustering.cnn.LabelInfo(origin: Optional[str], reference: Optional[Type[CNN]], params: Dict)

Contex information for labels

LabelInfo instances will be attached to Labels and/or modified by CNN.fit(), CNN.reel(), and CNN.predict().

origin: Optional[str]

Valid identifiers are:

  • “fitted”: Labels were assigned by CNN.fit().

  • “reeled”: Labels were overwritten by CNN.reel().

  • “predicted”: Labels were assigned by CNN.predict().

  • None: Unkown origin.

params: Dict

” An overview over cluster parameters used to assign labels. This is a dictionary with label numbers as keys and a parameter tuples (r, c) as values. This is useful if labels have been reeled or predicted and have different underlying parameters (Dict[int, Tuple(float, int)])

reference: Optional[Type[CNN]]

A CNN instance supporting the origin. If origin = “fitted” or origin = “reeled”, this is a reference to the object carriying the labels. If origin = “predicted” this is a reference to the object carrying the base labels. Can be None if reference is unknown.


Pandas

class cnnclustering.cnn.TypedDataFrame(columns, dtypes, content=None)

Optional constructor to convert CNNRecords to pandas.DataFrame


Decorators

cnnclustering.cnn.timed(function_)

Decorator to measure execution time.

Forwards the output of the wrapped function and measured execution time as a tuple.


cnnclustering.cnn.recorded(function_)

Decorator to format fit function feedback.

Used to decorate fit methods of CNN instances. Feedback needs to be sequence in record format, i.e. conforming to the CNNRecord namedtuple. If execution time was measured, the corresponding field will be modified.


Functional API

cnnclustering.cnn.calc_dist(data: Any, other: Optional[Any] = None, v: bool = True, method: str = 'cdist', mmap: bool = False, mmap_file: Optional[Union[pathlib.Path, str, IO[bytes]]] = None, chunksize: int = 10000, progress: bool = True, **kwargs) → Type[cnnclustering.cnn.Distances]

High level wrapper function for CNN.calc_dist().

A CNN instance is created with the given data as data points.

Parameters
  • data – Data suitable to be interpreted as Points.

  • other – Second data suitable to be interpreted as Points used for relative distance computation.

Returns

Distance matrix as instance of Distances of shape (n, m) with n points in data and m points in other. If other is None, m = n.


cnnclustering.cnn.fit(data: Any, radius_cutoff: Optional[float] = None, cnn_cutoff: Optional[int] = None, member_cutoff: Optional[int] = None, max_clusters: Optional[int] = None, cnn_offset: Optional[int] = None, info: bool = True, sort_by_size: bool = True, rec: bool = True, v: bool = True, policy: Optional[str] = None) → Type[cnnclustering.cnn.Labels]

High level wrapper function for CNN.fit().

A CNN instance is created with the given data as data points, distances, neighbourhoods or density graph.

Parameters

data – Data as instance of Points, Distances, Neighbourhoods, Densitygraph or any subclass.

Returns

Cluster label assignments as instance of Labels.

Raises

ValueError – If data is not of suitable type.