Hubness¶
-
class
hubness.analysis.
Hubness
(k: int = 10, hub_size: float = 2.0, metric='euclidean', k_neighbors: bool = False, k_occurrence: bool = False, verbose: int = 0, n_jobs: int = 1, random_state=None, shuffle_equal: bool = True, **kwargs)¶ Bases:
object
Hubness characteristics of data set.
- Parameters
k (int) – Neighborhood size
hub_size (float) – Hubs are defined as objects with k-occurrence > hub_size * k.
metric (string, one of ['euclidean', 'cosine', 'precomputed']) – Metric to use for distance computation. Currently, only Euclidean, cosine, and precomputed distances are supported.
k_neighbors (bool) – Whether to save the k-neighbor lists. Requires O(n_test * k) memory.
k_occurrence (bool) – Whether to save the k-occurrence. Requires O(n_test) memory.
random_state (int, RandomState instance or None, optional) – CURRENTLY IGNORED. If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random.
shuffle_equal (bool, optional) – If true and metric=’precomputed’, shuffle neighbors with identical distances to avoid artifact hubness. NOTE: This is especially useful for secondary distance measures with a finite number of possible values, e.g. SNN or MP empiric.
n_jobs (int, optional) – CURRENTLY IGNORED. Number of processes for parallel computations. - 1: Don’t use multiprocessing. - -1: Use all CPUs
verbose (int, optional) – Level of output messages
-
k_skewness_truncnom
¶ Hubness, measured as skewness of truncated normal distribution fitted with k-occurrence histogram
- Type
float
-
atkinson_index_
¶ Hubness, measured as the Atkinson index of k-occurrence distribution
- Type
float
-
gini_index_
¶ Hubness, measured as the Gini index of k-occurrence distribution
- Type
float
-
antihubs_
¶ Indices to antihubs
- Type
int
-
antihub_occurrence_
¶ Proportion of antihubs in data set
- Type
float
-
hubs_
¶ Indices to hubs
- Type
int
-
hub_occurrence_
¶ Proportion of k-nearest neighbor slots occupied by hubs
- Type
float
-
groupie_ratio_
¶ Proportion of objects with the largest hub in their neighborhood
- Type
float
-
k_occurrence_
¶ Reverse neighbor count for each object
- Type
ndarray
-
k_neighbors_
¶ Indices to k-nearest neighbors for each object
- Type
ndarray
References
- 1
Radovanović, M.; Nanopoulos, A. & Ivanovic, M. Hubs in space: Popular nearest neighbors in high-dimensional data. Journal of Machine Learning Research, 2010, 11, 2487-2531
- 2
Feldbauer, R.; Leodolter, M.; Plant, C. & Flexer, A. Fast approximate hubness reduction for large high-dimensional data. IEEE International Conference of Big Knowledge (2018).
Methods Summary
antihub_occurrence
(k_occurrence)Proportion of antihubs in data set.
atkinson_index
(k_occurrence[, eps])Hubness measure; Atkinson index.
estimate
(X[, Y, has_self_distances])Estimate hubness in a data set.
fit_transform
(X[, Y, has_self_distances])gini_index
(k_occurrence[, limiting])Hubness measure; Gini index
hub_occurrence
(k, k_occurrence, n_test[, …])Proportion of nearest neighbor slots occupied by hubs.
robinhood_index
(k_occurrence)Hubness measure; Robin hood/Hoover/Schutz index.
skewness_truncnorm
(k_occurrence)Hubness measure; corrected for non-negativity of k-occurrence.
Methods Documentation
-
static
antihub_occurrence
(k_occurrence: numpy.ndarray) -> (<built-in function array>, <class 'float'>)¶ Proportion of antihubs in data set.
Antihubs are objects that are never among the nearest neighbors of other objects.
- Parameters
k_occurrence (ndarray) – Reverse nearest neighbor count for each object.
-
static
atkinson_index
(k_occurrence: numpy.ndarray, eps: float = 0.5) → float¶ Hubness measure; Atkinson index.
- Parameters
k_occurrence (ndarray) – Reverse nearest neighbor count for each object.
eps (float, default = 0.5) – ‘Income’ weight. Turns the index into a normative measure.
-
estimate
(X: numpy.ndarray, Y: numpy.ndarray = None, has_self_distances: bool = False)¶ Estimate hubness in a data set.
Hubness is estimated from the distances between all objects in X to all objects in Y. If Y is None, all-against-all distances between the objects in X are used. If self.metric == ‘precomputed’, X must be a distance matrix.
- Parameters
X (ndarray, shape (n_query, n_features) or (n_query, n_indexed)) – Array of query vectors, or distance, if self.metric == ‘precomputed’
Y (ndarray, shape (n_indexed, n_features) or None) – Array of indexed vectors. If None, calculate distance between all pairs of objects in X.
has_self_distances (bool, default = False) – Define, whether a precomputed distance matrix contains self distances, which need to be excluded.
- Returns
self – An instance of class Hubness is returned. Hubness indices are provided as attributes (e.g. self.robinhood_index_).
- Return type
-
fit_transform
(X, Y=None, has_self_distances=False)¶
-
static
gini_index
(k_occurrence: numpy.ndarray, limiting='memory') → float¶ Hubness measure; Gini index
- Parameters
k_occurrence (ndarray) – Reverse nearest neighbor count for each object.
limiting ('memory' or 'cpu') – If ‘cpu’, use fast implementation with high memory usage, if ‘memory’, use slighly slower, but memory-efficient implementation, otherwise use naive implementation (slow, low memory usage)
-
static
hub_occurrence
(k: int, k_occurrence: numpy.ndarray, n_test: int, hub_size: float = 2)¶ Proportion of nearest neighbor slots occupied by hubs.
- Parameters
k (int) – Specifies the number of nearest neighbors
k_occurrence (ndarray) – Reverse nearest neighbor count for each object.
n_test (int) – Number of queries (or objects in a test set)
hub_size (float) – Factor to determine hubs
-
static
robinhood_index
(k_occurrence: numpy.ndarray) → float¶ Hubness measure; Robin hood/Hoover/Schutz index.
- Parameters
k_occurrence (ndarray) – Reverse nearest neighbor count for each object.
Notes
The Robin Hood index was proposed in [1]_ and is especially suited for hubness estimation in large data sets. Additionally, it offers straight-forward interpretability by answering the question: What share of k-occurrence must be redistributed, so that all objects are equally often nearest neighbors to others?
References
- 1
Feldbauer, R.; Leodolter, M.; Plant, C. & Flexer, A. Fast approximate hubness reduction for large high-dimensional data. IEEE International Conference of Big Knowledge (2018).
-
static
skewness_truncnorm
(k_occurrence: numpy.ndarray) → float¶ Hubness measure; corrected for non-negativity of k-occurrence.
Hubness as skewness of truncated normal distribution estimated from k-occurrence histogram.
- Parameters
k_occurrence (ndarray) – Reverse nearest neighbor count for each object.