skhubness.analysis.Hubness

class skhubness.analysis.Hubness(k: int = 10, hub_size: float = 2.0, metric='euclidean', k_neighbors: bool = False, k_occurrence: bool = False, verbose: int = 0, n_jobs: int = 1, random_state=None, shuffle_equal: bool = True, **kwargs)[source]

Hubness characteristics of data set.

Parameters
kint

Neighborhood size

hub_sizefloat

Hubs are defined as objects with k-occurrence > hub_size * k.

metricstring, one of [‘euclidean’, ‘cosine’, ‘precomputed’]

Metric to use for distance computation. Currently, only Euclidean, cosine, and precomputed distances are supported.

k_neighborsbool

Whether to save the k-neighbor lists. Requires O(n_test * k) memory.

k_occurrencebool

Whether to save the k-occurrence. Requires O(n_test) memory.

random_stateint, RandomState instance or None, optional

CURRENTLY IGNORED. If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random.

shuffle_equalbool, optional

If true and metric=’precomputed’, shuffle neighbors with identical distances to avoid artifact hubness. NOTE: This is especially useful for secondary distance measures with a finite number of possible values, e.g. SNN or MP empiric.

n_jobsint, optional

CURRENTLY IGNORED. Number of processes for parallel computations. - 1: Don’t use multiprocessing. - -1: Use all CPUs

verboseint, optional

Level of output messages

References

Ra56b19eecc1a-1

Radovanović, M.; Nanopoulos, A. & Ivanovic, M. Hubs in space: Popular nearest neighbors in high-dimensional data. Journal of Machine Learning Research, 2010, 11, 2487-2531

Ra56b19eecc1a-2

Feldbauer, R.; Leodolter, M.; Plant, C. & Flexer, A. Fast approximate hubness reduction for large high-dimensional data. IEEE International Conference of Big Knowledge (2018).

Attributes
k_skewness_float

Hubness, measured as skewness of k-occurrence histogram [Ra56b19eecc1a-1]

k_skewness_truncnomfloat

Hubness, measured as skewness of truncated normal distribution fitted with k-occurrence histogram

atkinson_index_float

Hubness, measured as the Atkinson index of k-occurrence distribution

gini_index_float

Hubness, measured as the Gini index of k-occurrence distribution

robinhood_index_float

Hubness, measured as Robin Hood index of k-occurrence distribution [Ra56b19eecc1a-2]

antihubs_int

Indices to antihubs

antihub_occurrence_float

Proportion of antihubs in data set

hubs_int

Indices to hubs

hub_occurrence_float

Proportion of k-nearest neighbor slots occupied by hubs

groupie_ratio_float

Proportion of objects with the largest hub in their neighborhood

k_occurrence_ndarray

Reverse neighbor count for each object

k_neighbors_ndarray

Indices to k-nearest neighbors for each object

__init__(self, k: int = 10, hub_size: float = 2.0, metric='euclidean', k_neighbors: bool = False, k_occurrence: bool = False, verbose: int = 0, n_jobs: int = 1, random_state=None, shuffle_equal: bool = True, **kwargs)[source]

Initialize self. See help(type(self)) for accurate signature.

Methods

__init__(self, k, hub_size[, metric, …])

Initialize self.

antihub_occurrence(k_occurrence)

Proportion of antihubs in data set.

atkinson_index(k_occurrence, eps)

Hubness measure; Atkinson index.

estimate(self, X, Y, has_self_distances)

Estimate hubness in a data set.

fit_transform(self, X[, Y, has_self_distances])

gini_index(k_occurrence[, limiting])

Hubness measure; Gini index

hub_occurrence(k, k_occurrence, n_test, hub_size)

Proportion of nearest neighbor slots occupied by hubs.

robinhood_index(k_occurrence)

Hubness measure; Robin hood/Hoover/Schutz index.

skewness_truncnorm(k_occurrence)

Hubness measure; corrected for non-negativity of k-occurrence.

static antihub_occurrence(k_occurrence: numpy.ndarray) -> (<built-in function array>, <class 'float'>)[source]

Proportion of antihubs in data set.

Antihubs are objects that are never among the nearest neighbors of other objects.

Parameters
k_occurrencendarray

Reverse nearest neighbor count for each object.

static atkinson_index(k_occurrence: numpy.ndarray, eps: float = 0.5) → float[source]

Hubness measure; Atkinson index.

Parameters
k_occurrencendarray

Reverse nearest neighbor count for each object.

epsfloat, default = 0.5

‘Income’ weight. Turns the index into a normative measure.

estimate(self, X: numpy.ndarray, Y: numpy.ndarray = None, has_self_distances: bool = False)[source]

Estimate hubness in a data set.

Hubness is estimated from the distances between all objects in X to all objects in Y. If Y is None, all-against-all distances between the objects in X are used. If self.metric == ‘precomputed’, X must be a distance matrix.

Parameters
Xndarray, shape (n_query, n_features) or (n_query, n_indexed)

Array of query vectors, or distance, if self.metric == ‘precomputed’

Yndarray, shape (n_indexed, n_features) or None

Array of indexed vectors. If None, calculate distance between all pairs of objects in X.

has_self_distancesbool, default = False

Define, whether a precomputed distance matrix contains self distances, which need to be excluded.

Returns
selfHubness

An instance of class Hubness is returned. Hubness indices are provided as attributes (e.g. robinhood_index_()).

static gini_index(k_occurrence: numpy.ndarray, limiting='memory') → float[source]

Hubness measure; Gini index

Parameters
k_occurrencendarray

Reverse nearest neighbor count for each object.

limiting‘memory’ or ‘cpu’

If ‘cpu’, use fast implementation with high memory usage, if ‘memory’, use slighly slower, but memory-efficient implementation, otherwise use naive implementation (slow, low memory usage)

static hub_occurrence(k: int, k_occurrence: numpy.ndarray, n_test: int, hub_size: float = 2)[source]

Proportion of nearest neighbor slots occupied by hubs.

Parameters
kint

Specifies the number of nearest neighbors

k_occurrencendarray

Reverse nearest neighbor count for each object.

n_testint

Number of queries (or objects in a test set)

hub_sizefloat

Factor to determine hubs

static robinhood_index(k_occurrence: numpy.ndarray) → float[source]

Hubness measure; Robin hood/Hoover/Schutz index.

Parameters
k_occurrencendarray

Reverse nearest neighbor count for each object.

Notes

The Robin Hood index was proposed in [1] and is especially suited for hubness estimation in large data sets. Additionally, it offers straight-forward interpretability by answering the question: What share of k-occurrence must be redistributed, so that all objects are equally often nearest neighbors to others?

References

1

Feldbauer, R.; Leodolter, M.; Plant, C. & Flexer, A. Fast approximate hubness reduction for large high-dimensional data. IEEE International Conference of Big Knowledge (2018).

static skewness_truncnorm(k_occurrence: numpy.ndarray) → float[source]

Hubness measure; corrected for non-negativity of k-occurrence.

Hubness as skewness of truncated normal distribution estimated from k-occurrence histogram.

Parameters
k_occurrencendarray

Reverse nearest neighbor count for each object.