cnn - A Python module for common-nearest-neighbour (CNN) clustering¶
Go to:
CNN class API¶
The functionality of this module is primarily exposed and bundled by the
cnnclustering.cnn.CNN
class. For hierarchical clusterings
cnnclustering.cnn.CNNChild
is used, too.
-
class
cnnclustering.cnn.
CNN
(points: Optional[Any] = None, distances: Optional[Any] = None, neighbourhoods: Optional[Any] = None, graph: Optional[Any] = None, labels: Collection[int] = None, alias: str = 'root')¶ CNN cluster object
A cluster object connects input data (points, distances, neighbours, or density graphs) to cluster results (labels) and clustering methodologies (fits). It also interfaces several convenience functions.
- Parameters
points – Argument passed on to
Data
to construct attribute data.distances – Argument passed on to
Data
to construct attribute data.neighbourhoods – Argument passed on to
Data
to construct attribute data.graph – Argument passed on to
Data
to construct attributedata
.labels – Argument passed on to
Labels
to construct attributelabels
.alias – Descriptive object identifier.
-
alias
¶ Descriptive object identifier.
-
hierarchy_level
¶ Level in the cluster hierarchy of the cluster object.
-
status
¶ Dictionary summarising the current state of the cluster object.
-
children
¶ Dictionary of child cluster objects (of type
CNNChild
). Created from cluster label assignments byisolate()
.
-
calc_dist
(other: Optional[Type[CNN]] = None, v: bool = True, method: str = 'cdist', mmap: bool = False, mmap_file: Optional[Union[pathlib.Path, str, IO[bytes]]] = None, chunksize: int = 10000, progress: bool = True, **kwargs)¶ Compute a distance matrix
Requires
Data.points
, computes distances and setsData.distances
.- Parameters
other – If not None, a second
CNN
cluster object. Distances are calculated between n points in self and m points in other. If None, distances are calculated within self.v – Be chatty.
method –
Method to compute distances with:
cdist: scipy.spatial.distance.cdist.
mmap – Wether to memory map the calculated distances on disk with NumPy.
mmap_file – If mmap is set to True, where to store the file. If None, uses a temporary file.
chunksize – Portions of data to process at once. Can be used to keep memory consumption low. Only useful together with mmap.
progress – Wether to show a progress bar.
**kwargs – Pass on to whatever is used as method.
- Returns
None
- Raises
ValueError – If method not known.
-
calc_neighbours_from_cKDTree
(r: float, other: Optional[Type[CNN]] = None, format='array_arrays', **kwargs)¶ Calculate neighbourhoods from tree structure
Requires
Data.points.tree
, computes neighbourhoods using scipy.spatial.cKDTree.query_ball_tree and setsData.neighbourhoods
. See alsoPoints.cKDTree()
to build a suitable tree structure from data points.- Parameters
r – Search query radius.
other – If not None, another
CNN
instance whose data points should be used for a relative neighbour search. Also requiresother.data.points.tree
.format –
Output format for the created neighbourhoods:
”list_sets”: List of sets of integer point indices (
NeighbourhoodsList
).”array_arrays”: 1D NumPy array of 1D NumPy arrays of integer point indices (
NeighbourhoodsArray
).
**kwargs – Keyword args passed on to scipy.spatial.cKDTree.query_ball_tree
-
calc_neighbours_from_dist
(r: float, format='array_arrays')¶ Calculate neighbourhoods from distances
Requires
Data.distances
, computes neighbourhoods and setsData.neighbourhoods
.- Note: Currently only working for distance objects of
type
Distances
.
- Parameters
r – Search query radius.
format –
Output format for the created neighbourhoods:
”list_sets”: List of sets of integer point indices (
NeighbourhoodsList
).”array_arrays”: 1D NumPy array of 1D NumPy arrays of integer point indices (
NeighbourhoodsArray
).
- Returns
None
-
check
()¶ Check current data state
Check if data points, distances, neighbourhoods or a density graph are present. Check depends on length of the stored objects. An empty data structure (length = 0) represents no data. Sets
status
.- Returns
None
-
cut
(part: Optional[int] = None, points: Tuple[Optional[int], …] = None, dimensions: Tuple[Optional[int], …] = None)¶ Create a new
CNN
instance from a data subsetConvenience function to create a reduced cluster object. Supported are continuous slices from the original data that allow making a view instead of a copy.
- Parameters
part – Cut out the points for exactly one part (zero based index).
points – Slice points by using (start:stop:step)
dimensions – Slice dimensions by using (start:stop:step)
- Returns
-
dist_hist
(ax: Optional[Type[matplotlib.axes._subplots.SubplotBase]] = None, maxima: bool = False, maxima_props: Optional[Dict[str, Any]] = None, hist_props: Optional[Dict[str, Any]] = None, ax_props: Optional[Dict[str, Any]] = None, inter_props: Optional[Dict[str, Any]] = None)¶ Plot a histogram of distances in the data set
Requires
data.distances
.- Parameters
ax – Matplotlib Axes to plot on. If None, Figure and Axes are created.
maxima – Whether to mark the maxima of the distribution. Uses scipy.signal.argrelextrema.
maxima_props – Keyword arguments passed to scipy.signal.argrelextrema if maxima is set to True.
hist_props – Keyword arguments passed to numpy.histogram to compute the histogram.
ax_props – Keyword arguments for Matplotlib Axes styling.
-
evaluate
(ax: Optional[Type[matplotlib.axes._subplots.SubplotBase]] = None, clusters: Optional[Collection[int]] = None, original: bool = False, plot: str = 'dots', points: Optional[Tuple[Optional[int]]] = None, dim: Optional[Tuple[int, int]] = None, ax_props: Optional[Dict] = None, annotate: bool = True, annotate_pos: str = 'mean', annotate_props: Optional[Dict] = None, plot_props: Optional[Dict] = None, plot_noise_props: Optional[Dict] = None, hist_props: Optional[Dict] = None, free_energy: bool = True)¶ Returns a 2D plot of an original data set or a cluster result
- Args: ax: The Axes instance to which to add the plot. If
None, a new Figure with Axes will be created.
- clusters:
Cluster numbers to include in the plot. If None, consider all.
- original:
Allows to plot the original data instead of a cluster result. Overrides clusters. Will be considered True, if no cluster result is present.
- plot:
The kind of plotting method to use.
“dots”,
ax.plot()
“scatter”,
ax.scatter()
“contour”,
ax.contour()
“contourf”,
ax.contourf()
- parts:
Use a slice (start, stop, stride) on the data parts before plotting.
- points:
Use a slice (start, stop, stride) on the data points before plotting.
- dim:
Use these two dimensions for plotting. If None, uses (0, 1).
- annotate:
If there is a cluster result, plot the cluster numbers. Uses annotate_pos to determinte the position of the annotations.
- annotate_pos:
Where to put the cluster number annotation. Can be one of:
“mean”, Use the cluster mean
“random”, Use a random point of the cluster
Alternatively a list of x, y positions can be passed to set a specific point for each cluster (Not yet implemented)
- annotate_props:
Dictionary of keyword arguments passed to
ax.annotate()
.- ax_props:
Dictionary of ax properties to apply after plotting via
ax.set(**ax_props)()
. If None, uses defaults that can be also defined in the configuration file (Note yet implemented).- plot_props:
Dictionary of keyword arguments passed to various functions (
_plots.plot_dots()
etc.) with different meaning to format cluster plotting. If None, uses defaults that can be also defined in the configuration file (Note yet implemented).- plot_noise_props:
Like plot_props but for formatting noise point plotting.
- hist_props:
Dictionary of keyword arguments passed to functions that involve the computing of a histogram via numpy.histogram2d.
- free_energy:
If True, converts computed histograms to pseudo free energy surfaces.
- mask:
Sequence of boolean or integer values used for optional fancy indexing on the point data array. Note, that this is applied after regular slicing (e.g. via points) and requires a copy of the indexed data (may be slow and memory intensive for big data sets).
- Returns
Figure, Axes and a list of plotted elements
-
fit
(*args, **kwargs) → Optional[Tuple[cnnclustering.cnn.CNNRecord, bool]]¶ Wraps CNN clustering execution
Requires one of
Data.graph
,Data.neighbourhoods
,Data.distances
, orData.points
and setslabels
.This function prepares the clustering and calls an appropriate worker function to do the actual clustering. How the clustering is done, depends on the current data situation and the selected policy. The clustering can be done with different inputs:
The differing input structures have varying memory demands. In particular storage of distances can be costly memory-wise (memory complexity \(\mathcal{O}(n^2)\)). Ultimately, the clustering depends on neighbourhood or density graph information. The clustering is fast if neighbourhoods or a density graph are pre-computed but this has to be re-done for each radius_cutoff and/or cnn_cutoff separately. Neighbourhoods can be calculated either from data points (e.g.
calc_neighbours_from_cKDTree()
), or pre-computed pairwaise distances (e.g.calc_neighbours_from_dist()
). The user is encouraged to apply any external method of choice to provide neighbourhoods in a format that can be clustered. If a primary input structure (points or distances) is given, neighbourhoods will be computed. If the user chooses policy = “progressive”, neighbourhoods will be computed from either distances (if present) or points before the clustering automatically. If the user chooses policy = “conservative”, neighbourhoods will be computed on-the-fly (online) from either distances (if present) or points during the clustering. This can save memory but can be computational more expensive. Caching can be used to achieve the right balance between memory usage and computing effort for your situation.- Parameters
radius_cutoff – Radius cutoff cluster parameter.
cnn_cutoff – CNN cutoff cluster parameter (similarity criterion).
member_cutoff – Valid clusters need to have at least this many members. Passed to
Labels.sort_by_size()
if sort_by_size is True. Has no effect otherwise and valid clusters have at least two members.max_clusters – Keep only the largest max_clusters clusters. Passed to
Labels.sort_by_size()
if sort_by_size is True. Has no effect otherwise.cnn_offset – Exists for compatibility reasons and is substracted from cnn_cutoff. If cnn_offset = 0, two points need to share at least cnn_cutoff neighbours to be part of the same cluster without counting any of the two points. In former versions of the clustering, self-counting was included and cnn_cutoff = 2 was equivalent to cnn_cutoff = 0 in this version.
info – Weather to attach
LabelInfo
to createdLabels
instance.sort_by_size – Weather to sort (and trim) the created
Labels
instance. See alsoLabels.sort_by_size()
.rec – Weather to create and return a
CNNRecord
instance in the end.v – Be chatty.
policy –
Determines the computation behaviour depending on the given data situation:
- ”progressive”: Bulk-compute neighbourhoods
if needed.
- ”conservative”: Compute neighbourhoods on-the-fly
if needed.
- Returns
Tuple(
CNNRecord
, v) if rec is True, None otherwise- Raises
AssertionError – If policy is not in [“progressive”, “conservative”]
-
get_dtraj
()¶ Transform cluster labels to discrete trajectories
- Returns
[description]
- Return type
[type]
-
get_samples
(kind: str = 'mean', clusters: Optional[List[int]] = None, n_samples: int = 1, by_parts: bool = True, skip: int = 0, stride: int = 1) → Dict[int, List[int]]¶ Select sample points from clusters
- Parameters
kind –
How to choose the samples:
”mean”:
”random”:
”all”:
clusters – List of cluster numbers to consider
n_samples – How many samples to return
byparts – Return point indices as list of lists by parts
skip – Skip the first n frames
stride – Take only every n th frame
- Returns
Dictionary of sample point indices as list for each cluster
-
isolate
(purge: bool = True) → None¶ Isolates points per clusters based on a cluster result
-
predict
(**kwargs) → None¶ Wraps CNN cluster prediction execution
Predict labels for points in a data set (other) on the basis of assigned labels to a “train” set (self).
- Parameters
other –
CNN
cluster object for whose points cluster labels should be predicted.radius_cutoff – Find nearest neighbours within distance r.
cnn_cutoff – Points of the same cluster must have at least c common nearest neighbours (similarity criterion).
include_all – If False, keep cluster assignment for points in the test set that have a maximum distance of same_tol to a point in the train set, i.e. they are essentially the same point (currently not implemented).
same_tol – Distance cutoff to treat points as the same, if include_all is False.
clusters – Predict assignment of points only with respect to this list of clusters.
purge – If True, reinitalise predicted labels. Override assignment memory.
cnn_offset – Mainly for backwards compatibility. Modifies the the cnn_cutoff.
policy –
Determines the computation behaviour depending on the given data situation:
- ”progressive”: Bulk-compute neighbourhoods
if needed.
- ”conservative”: Compute neighbourhoods on-the-fly
if needed.
Returns – None
-
reel
(deep: Optional[int] = 1) → None¶ Wrap up assigments of lower hierarchy levels
- Parameters
deep – How many lower levels to consider. If None, consider all.
-
class
cnnclustering.cnn.
CNNChild
(parent, *args, alias='child', **kwargs)¶ CNN cluster object subclass.
Increments the hierarchy level of the parent object when instanciated.
-
parent
¶ Reference to parent
-
Data input formats¶
Input data of differing nature and format (points, distances,
neighbourhoods,
density graphs) are
organised and bundled by the cnnclustering.cnn.Data
class.
-
class
cnnclustering.cnn.
Data
(points=None, distances=None, neighbourhoods=None, graph=None)¶ Abstraction class for handling input data
A data object bundles points, distances, neighbourhoods and density graphs.
- Parameters
points – Used to set
points
. If None, creates an empty instance ofPoints
. If an instance ofPoints
is passed, uses this instance. If anything else is passed, tries to create an instance ofPoints
usingPoints.from_parts()
.distances – Used to set
distances
. If None, creates an empty instance ofDistances
. If an instance ofDistances
is passed, uses this instance. If anything else is passed, tries to create an instance ofDistances
using the default constructor.neighbourhoods – Used to set
neighbourhoods
. If None, creates an empty instance ofNeighbourhoodsArray
. If an instance of a class qualifying as neighbourhoods object (seeNeighbourhoodsABC
) is passed, uses this instance. If anything else is passed, tries to create an instance ofNeighbourhoodsArray
using the default constructor.graph – Used to set
graph. If `None
, creates an empty instance ofDensitySparsegraphArray
. If an instance of a class qualifying as density graph object (seeDensitygraphABC
) is passed, uses this instance. If anything else is passed, tries to create an instance ofDensitySparsegraphArray
using the default constructor.
-
shape
¶ Dictionary summarising size of data structures.
-
neighbourhoods
¶ Instance of subclass of
NeighbourhoodsABC
.
-
graph
¶ Instance of subclass of
DensitygraphABC
.
Points¶
Points are currently supported in the form of a 2D NumPy array of shape
(n: points, d: dimensions) through the
cnnclustering.cnn.Points
class.
-
class
cnnclustering.cnn.
Points
(p: Optional[numpy.ndarray] = None, edges: Optional[Sequence] = None, tree: Optional[Any] = None)¶ Abstraction class for data points
-
by_parts
() → Iterator¶ Yield data by parts
- Returns
Generator of 2D
numpy.ndarray
s (parts)
-
cKDTree
(**kwargs)¶ Wrapper for
scipy.spatial.cKDTree()
Sets
Points.tree
.- Parameters
**kwargs – Passed to
scipy.spatial.cKDTree()
-
classmethod
from_file
(f: Union[str, pathlib.Path], *args, from_parts: bool = False, **kwargs)¶ Alternative constructor
Load file content to be interpreted as points. Uses
load()
to read files.- Recognised input formats are:
Sequence
2D Sequence (sequence of sequences all of same length)
Sequence of 2D sequences all of same second dimension
- Parameters
f – File name as string or
pathlib.Path
object.*args – Arguments passed to
load()
.from_parts – If True uses
from_parts()
constructor. If False uses default constructor.
- Returns
Instance of
Points
-
classmethod
from_parts
(p: Optional[Sequence])¶ Alternative constructor
Use if data is passed as collection of parts, as
>>> p = Points.from_parts([[[0, 0], [1, 1]], ... [[2, 2], [3,3]]]) ... p Points([[0, 0], [1, 1], [2, 2], [3, 3]])
- Recognised input formats are:
Sequence
2D Sequence (sequence of sequences all of same length)
Sequence of 2D sequences all of same second dimension
In this way, part edges are taken from the input shape and do not have to be specified explicitly. Calls
get_shape()
.- Parameters
p – File name as string or
pathlib.Path
object- Returns
Instance of
Points
-
static
get_shape
(data: Any)¶ Maintain data in universal shape (2D NumPy array)
Analyses the format of given data and fits it into the standard format (parts, points, dimensions). Creates a
numpy.ndarray
vstacked along the parts componenent that can be passed to the Points constructor alongside part edges. This may not be able to deal with all possible kinds of input structures correctly, so check the outcome carefully.- Recognised input formats are:
Sequence
2D Sequence (sequence of sequences all of same length)
Sequence of 2D sequences all of same second dimension
- Parameters
data –
Either None or:
a 1D sequence of length d, interpreted as 1 point in d dimension
a 2D sequence of length n (rows) and width d (columns), interpreted as n points in d dimensions
a list of 2D sequences, interpreted as groups (parts) of points
- Returns
Tuple of
NumPy array of shape (\(\sum n, d\))
Part edges list, marking the end points of the parts
-
static
load
(f: Union[pathlib.Path, str], *args, **kwargs) → None¶ Loads file content
Depending on the filename extension, a suitable loader is called:
.p:
pickle.load()
.npy:
numpy.load()
None:
numpy.loadtxt()
.xvg, .dat:
numpy.loadtxt()
Sets
data
andshape
.- Parameters
f – File
*args – Passed to loader
- Keyword Arguments
**kwargs – Passed to loader
- Returns
Return value of the loader
-
Distances¶
Distances are currently supported in the form of a 2D NumPy array of shape
(n: points, m: points) through the
cnnclustering.cnn.Distances
class.
-
class
cnnclustering.cnn.
Distances
(d: Optional[numpy.ndarray] = None, reference=None)¶ Abstraction class for data point distances
Neighbourhoods¶
Neighbourhoods are currently supported in the form of:
1D NumPy array of 1D Numpy arrays (
cnnclustering.cnn.NeighbourhoodsArray
)Python list of Python sets (
cnnclustering.cnn.NeighbourhoodsList
)
Valid neighbourhood containers should inherit from the absract base
class cnnclustering.cnn.NeighbourhoodsABC
or a the general
realisation cnnclustering.cnn.Neighbourhoods
.
-
class
cnnclustering.cnn.
NeighbourhoodsArray
(sequence: Optional[Sequence[int]] = None, radius=None, reference=None, self_counting=False)¶ Array of array realisation of neighbourhood abstraction
Inherits from
numpy.ndarray
andNeighbourhoods
.-
property
n_neighbours
¶ Return number of neighbours for each point
-
property
-
class
cnnclustering.cnn.
NeighbourhoodsList
(neighbourhoods=None, radius=None, reference=None)¶ List of sets realisation of neighbourhood abstraction
Inherits from
collections.UserList
andNeighbourhoods
.-
property
n_neighbours
¶ Return number of neighbours for each point
-
property
-
class
cnnclustering.cnn.
Neighbourhoods
(neighbourhoods=None, radius=None, reference=None, self_counting=False)¶ Basic realisation of neighbourhood abstraction
Inherits from
NeighboursABC
.Makes no assumptions on the nature of the stored neighbours and provides default implementations for the required attributes by
NeighboursABC
. Since working realisations of theNeighbourhoods
base class usually inherit with priority from a collection type whose __init__ mechanism is probably used, the alternative method init_finalise is offered to set the required attributes.-
property
n_neighbours
¶ Return number of neighbours for each point
-
property
radius
¶ Return radius
-
property
reference
¶ Return reference CNN instance
-
property
self_counting
¶ Self-counting of points as their own neighbours?
-
property
-
class
cnnclustering.cnn.
NeighbourhoodsABC
¶ Abstraction class for neighbourhoods
Neighbourhoods (integer point indices) can be stored in different data structures (non-exhaustive listing):
Collection of collections:
list of sets (
NeighbourhoodsList
)array of arrays (
NeighbourhoodsArray
)
Linear collection plus slice indicator:
array of neighbours plus array of starting indices (
NeighbourhoodsSparsegraphArray
)array in which one element indicates the length and the following elements are neighbours (
NeighbourhoodsLinear
)
To qualify as a neighbourhoods container, the following attributes should be present in any case:
- radius: Points are neighbours of each other with respect to
this radius (any metric).
- reference: A
CNN
instance if neighbourhoods are valid for points in different data sets.
- n_neighbours: Return the neighbourcount for each point in
the container.
- self_counting: Boolean indicator, True if neighbourhoods include
self-counting (point is its own neighbour).
- __str__: A useful str-representation revealing the type and
the radius.
-
abstract property
n_neighbours
¶ Return number of neighbours for each point
-
abstract property
radius
¶ Return radius
-
abstract property
reference
¶ Return reference CNN instance
-
abstract property
self_counting
¶ Self-counting of points as their own neighbours?
Density graphs¶
Density graphs are currently supported in the form of a sparse graphs
through the cnnclustering.cnn.SparsegraphArray
class.
Valid density graph containers should inherit from the absract base
class cnnclustering.cnn.DensitygraphABC
or a the general
realisation cnnclustering.cnn.Densitygraph
.
-
class
cnnclustering.cnn.
DensitySparsegraphArray
¶
-
class
cnnclustering.cnn.
Densitygraph
¶
-
class
cnnclustering.cnn.
DensitygraphABC
¶
Cluster results¶
Cluster results can be recorded by cnnclustering.cnn.CNN.fit()
as
cnnclustering.cnn.CNNRecord
. Multiple records are collected in by
cnnclustering.cnn.Summary
. Cluster label assignments are stored
independently and are currently supported through
cnnclustering.cnn.Labels
. Labels can be put into context by
attaching cnnclustering.cnn.LabelInfo
.
-
class
cnnclustering.cnn.
CNNRecord
(points: int, r: float, c: int, min: int, max: int, clusters: int, largest: float, noise: float, time: float)¶ Cluster result container
CNNRecord
instances can be returned byCNN.fit()
and are collected inSummary
.-
c
: int¶ CNN cutoff c (similarity criterion)
-
clusters
: int¶ Number of identified clusters.
-
largest
: float¶ Ratio of points in the largest cluster.
-
max
: int¶ Maximum cluster number. After sorting, only the biggest max clusters are kept.
-
min
: int¶ Member cutoff. Valid clusters have at least this many members.
-
noise
: float¶ Ratio of points classified as outliers.
-
points
: int¶ Number of points in the clustered data set.
-
r
: float¶ Radius cutoff r.
-
time
: float¶ Measured execution time for the fit, including sorting in seconds.
-
-
class
cnnclustering.cnn.
Summary
(iterable=None)¶ List like container for cluster results
Stores instances of
CNNRecord
.-
insert
(index, item)¶ S.insert(index, value) – insert value before index
-
summarize
(ax: Optional[Type[matplotlib.axes._subplots.SubplotBase]] = None, quant: str = 'time', treat_nan: Optional[Any] = None, ax_props: Optional[Dict] = None, contour_props: Optional[Dict] = None)¶ Generate a 2D plot of record values
Record values (“time”, “clusters”, “largest”, “noise”) are plotted against cluster parameters (radius cutoff r and cnn cutoff c).
- Parameters
ax – Matplotlib Axes to plot on. If None, a new Figure with Axes will be created.
quant –
Record value to visualise:
”time”
”clusters”
”largest”
”noise”
treat_nan – If not None, use this value to pad nan-values.
ax_props – Used to style ax.
contour_props – Passed on to contour.
-
to_DataFrame
()¶ Convert list of records to (typed) pandas DataFrame
- Returns
-
-
class
cnnclustering.cnn.
Labels
(sequence: Optional[Sequence[int]] = None, info=None, consider=None)¶ Cluster label assignments
Inherits from
numpy.ndarray
.- Parameters
-
consider
¶ Array of 0 and 1 of same length as labels, indicating which labels should be still considered (e.g. for predictions).
-
static
dict2labels
(dictionary: Dict[int, Collection[int]]) → Type[numpy.ndarray]¶ Convert cluster dictionary to labels
- Parameters
dictionary – Dictionary of point indices per cluster label to convert
- Returns
Sequenc of labels for each point as NumPy ndarray
-
fix_missing
()¶ Fix missing cluster labels and ensure continuous numbering
If you also want the labels to be sorted by clustersize use
sort_by_size()
instead, which re-numbers clusters, too.
-
static
labels2dict
(labels: Collection[int]) → Dict[int, Set[int]]¶ Convert labels to cluster dictionary
- Parameters
labels – Sequence of integer cluster labels to convert
- Returns
Dictionary of sets of point indices with cluster labels as keys
-
merge
(clusters: List[int]) → None¶ Merge a list of clusters into one
-
sort_by_size
(member_cutoff=None, max_clusters=None)¶ Sort labels by clustersize in-place
Re-assigns cluster numbers so that the biggest cluster (that is not noise) is cluster 1. Also filters clusters out, that have not at least member_cutoff members. Optionally, does only keep the max_clusters largest clusters. Returns the member count in the largest cluster and the number of points declared as noise.
- Parameters
member_cutoff – Valid clusters need to have at least this many members.
max_clusters – Only keep this many clusters.
- Returns
(#member largest, #member noise)
-
trash
(clusters: List[int]) → None¶ Merge a list of clusters into noise
-
class
cnnclustering.cnn.
LabelInfo
(origin: Optional[str], reference: Optional[Type[CNN]], params: Dict)¶ Contex information for labels
LabelInfo
instances will be attached toLabels
and/or modified byCNN.fit()
,CNN.reel()
, andCNN.predict()
.-
origin
: Optional[str]¶ Valid identifiers are:
“fitted”: Labels were assigned by
CNN.fit()
.“reeled”: Labels were overwritten by
CNN.reel()
.“predicted”: Labels were assigned by
CNN.predict()
.None: Unkown origin.
-
params
: Dict¶ ” An overview over cluster parameters used to assign labels. This is a dictionary with label numbers as keys and a parameter tuples (r, c) as values. This is useful if labels have been reeled or predicted and have different underlying parameters (Dict[int, Tuple(float, int)])
-
Pandas¶
-
class
cnnclustering.cnn.
TypedDataFrame
(columns, dtypes, content=None)¶ Optional constructor to convert CNNRecords to pandas.DataFrame
Decorators¶
-
cnnclustering.cnn.
timed
(function_)¶ Decorator to measure execution time.
Forwards the output of the wrapped function and measured execution time as a tuple.
-
cnnclustering.cnn.
recorded
(function_)¶ Decorator to format fit function feedback.
Used to decorate fit methods of
CNN
instances. Feedback needs to be sequence in record format, i.e. conforming to theCNNRecord
namedtuple. If execution time was measured, the corresponding field will be modified.
Functional API¶
-
cnnclustering.cnn.
calc_dist
(data: Any, other: Optional[Any] = None, v: bool = True, method: str = 'cdist', mmap: bool = False, mmap_file: Optional[Union[pathlib.Path, str, IO[bytes]]] = None, chunksize: int = 10000, progress: bool = True, **kwargs) → Type[cnnclustering.cnn.Distances]¶ High level wrapper function for
CNN.calc_dist()
.A
CNN
instance is created with the given data as data points.
-
cnnclustering.cnn.
fit
(data: Any, radius_cutoff: Optional[float] = None, cnn_cutoff: Optional[int] = None, member_cutoff: Optional[int] = None, max_clusters: Optional[int] = None, cnn_offset: Optional[int] = None, info: bool = True, sort_by_size: bool = True, rec: bool = True, v: bool = True, policy: Optional[str] = None) → Type[cnnclustering.cnn.Labels]¶ High level wrapper function for
CNN.fit()
.A
CNN
instance is created with the given data as data points, distances, neighbourhoods or density graph.- Parameters
data – Data as instance of
Points
,Distances
,Neighbourhoods
,Densitygraph
or any subclass.- Returns
Cluster label assignments as instance of
Labels
.- Raises
ValueError – If data is not of suitable type.