API

class topo.TopOGraph(base_knn=10, graph_knn=10, n_eigs=100, basis='diffusion', graph='diff', base_metric='cosine', graph_metric='cosine', n_jobs=1, backend='nmslib', M=15, efC=50, efS=50, verbosity=1, cache_base=True, cache_graph=True, kernel_use='simple', alpha=1, plot_spectrum=False, eigen_expansion=False, delta=1.0, t='inf', p=0.6875, transitions=True, random_state=None)

Main TopOMetry class for learning topological similarities, bases, graphs, and layouts from high-dimensional data.

From data, learns topological similarity metrics, from these build orthogonal bases and from these bases learns topological graphs. Users can choose different models to achieve these topological representations, combinining either diffusion harmonics, continuous k-nearest-neighbors or fuzzy simplicial sets to approximate the Laplace-Beltrami Operator. The topological graphs can then be visualized with multiple existing layout optimization tools.

Parameters
  • base_knn (int (optional, default 10)) – Number of k-nearest-neighbors to use when learning topological similarities. Consider this as a calculus discretization threshold (i.e. approaches zero in the limit of large data). For practical purposes, the minimum amount of samples one would expect to constitute a neighborhood of its own. Increasing k can generate more globally-comprehensive metrics and maps, to a certain extend, however at the expense of fine-grained resolution. In practice, the default value of 10 performs quite well for almost all cases.

  • graph_knn (int (optional, default 10)) – Similar to base_knn, but used to learning topological graphs from the orthogonal bases.

  • n_eigs (int (optional, default 100)) – Number of components to compute. This number can be iterated to get different views from data at distinct spectral resolutions. If basis is set to diffusion, this is the number of computed diffusion components. If basis is set to continuous or fuzzy, this is the number of computed eigenvectors of the Laplacian Eigenmaps from the learned topological similarities.

  • basis ('diffusion', 'continuous' or 'fuzzy' (optional, default 'diffusion')) – Which topological basis model to learn from data. If diffusion, performs an optimized, anisotropic, adaptive diffusion mapping (default). If continuous, computes affinities from continuous k-nearest-neighbors, and a topological basis with Laplacian Eigenmaps. If fuzzy, computes affinities using fuzzy simplicial sets, and a topological basis with Laplacian Eigenmaps.

  • graph ('diff', 'cknn' or 'fuzzy' (optional, default 'diff')) – Which topological graph model to learn from the built basis. If ‘diff’, uses a second-order diffusion process to learn similarities and transition probabilities. If ‘cknn’, uses the continuous k-nearest-neighbors algorithm. If ‘fuzzy’, builds a fuzzy simplicial set graph from the active basis. All these algorithms learn graph-oriented topological metrics from the learned basis.

  • backend (str 'hnwslib', 'nmslib' or 'sklearn' (optional, default 'nmslib')) –

    Which backend to use to compute nearest-neighbors. Options for fast, approximate nearest-neighbors are ‘hnwslib’ and ‘nmslib’ (default). For exact nearest-neighbors, use ‘sklearn’.

    • If using ‘nmslib’, a sparse

    csr_matrix input is expected. If using ‘hnwslib’ or ‘sklearn’, a dense array is expected.

    • I strongly recommend you use ‘hnswlib’ if handling with somewhat dense, array-shaped data. If the data

    is relatively sparse, you should use ‘nmslib’, which operates on sparse matrices by default on TopOMetry and will automatically convert the input array to csr_matrix for performance.

base_metricstr (optional, default ‘cosine’)

Distance metric for building an approximate kNN graph during topological basis construction. Defaults to ‘cosine’. Users are encouraged to explore different metrics, such as ‘euclidean’ and ‘inner_product’. The ‘hamming’ and ‘jaccard’ distances are also available for string vectors. Accepted metrics include NMSLib(*), HNSWlib(**) and sklearn(***) metrics. Some examples are:

-‘sqeuclidean’ (, *)

-‘euclidean’ (, *)

-‘l1’ (*)

-‘lp’ - requires setting the parameter p (*) - similar to Minkowski

-‘cosine’ (, *)

-‘inner_product’ (**)

-‘angular’ (*)

-‘negdotprod’ (*)

-‘levenshtein’ (*)

-‘hamming’ (*)

-‘jaccard’ (*)

-‘jansen-shan’ (*)

graph_metricstr (optional, default ‘cosine’).

Similar to base_metric, but used for building the topological graph.

pint or float (optional, default 11/16 ).

P for the Lp metric, when metric=’lp’. Can be fractional. The default 11/16 approximates 2/3, that is, an astroid norm with some computational efficiency (2^n bases are less painstakinly slow to compute).

n_jobsint (optional, default 10).

Number of threads to use in calculations. Set this to as much as possible for speed.

Mint (optional, default 15).

A neighborhood search parameter. Defines the maximum number of neighbors in the zero and above-zero layers during HSNW (Hierarchical Navigable Small World Graph). However, the actual default maximum number of neighbors for the zero layer is 2*M. A reasonable range for this parameter is 5-100. For more information on HSNW, please check its manuscript(https://arxiv.org/abs/1603.09320). HSNW is implemented in python via NMSlib (https://github.com/nmslib/nmslib) and HNWSlib (https://github.com/nmslib/hnswlib).

efCint (optional, default 50).

A neighborhood search parameter. Increasing this value improves the quality of a constructed graph and leads to higher accuracy of search. However this also leads to longer indexing times. A reasonable range for this parameter is 50-2000.

efSint (optional, default 50).

A neighborhood search parameter. Similarly to efC, increasing this value improves recall at the expense of longer retrieval time. A reasonable range for this parameter is 100-2000.

transitionsbool (optional, default False).

A diffusion harmonics parameter. Whether to use the transition probabilities rather than the diffusion potential when computing the diffusion harmonics model.

alphaint or float (optional, default 1).

A diffusion harmonics parameter. Alpha in the diffusion maps literature. Controls how much the results are biased by data distribution. Defaults to 1, which unbiases results from data underlying samplg distribution.

kernel_usestr (optional, default ‘simple’)

A diffusion harmonics parameter. Which type of kernel to use in the diffusion harmonics model. There are four implemented, considering the adaptive decay and the neighborhood expansion, written as ‘simple’, ‘decay’, ‘simple_adaptive’ and ‘decay_adaptive’.

*The first, ‘simple’, is a locally-adaptive kernel similar to that proposed by Nadler et al. (https://doi.org/10.1016/j.acha.2005.07.004) and implemented in Setty et al. (https://doi.org/10.1038/s41587-019-0068-4). It is the fastest option.

*The ‘decay’ option applies an adaptive decay rate, but no neighborhood expansion.

*Those, followed by ‘_adaptive’, apply the neighborhood expansion process.

The neighborhood expansion can impact runtime, although this is not usually expressive for datasets under 10e6 samples. If you’re not obtaining good separation between expect clusters, consider changing this to ‘decay_adaptive’ with a small number of neighbors.

deltafloat (optional, default 1.0).

A CkNN parameter to decide the radius for each points. The combination radius increases in proportion to this parameter.

t‘inf’ or float or int, optional, default=’inf’

A CkNN parameter encoding the decay of the heat kernel. The weights are calculated as: W_{ij} = exp(-(||x_{i}-x_{j}||^2)/t)

verbosityint (optional, default 1).

Controls verbosity. 0 for no verbosity, 1 for minimal (prints warnings and runtimes of major steps), 2 for medium (also prints layout optimization messages) and 3 for full (down to neighborhood search, useful for debugging).

cache_basebool (optional, default True).

Whether to cache intermediate matrices used in computing orthogonal bases (k-nearest-neighbors, diffusion harmonics etc).

cache_graphbool (optional, default True).

Whether to cache intermediate matrices used in computing topological graphs (k-nearest-neighbors, diffusion harmonics etc).

plot_spectrumbool (optional, default False).

Whether to plot the informational decay spectrum obtained during eigendecomposition of similarity matrices.

eigen_expansionbool (optional, default False).

Whether to try to find a discrete eigengap during eigendecomposition. This can severely impact runtime, as it can take numerous eigendecompositions to do so.

MAP(data=None, graph=None, n_components=2, min_dist=0.3, spread=1, initial_alpha=1, n_epochs=400, metric=None, metric_kwds={}, output_metric='euclidean', output_metric_kwds={}, gamma=1.2, negative_sample_rate=10, init='spectral', random_state=None, euclidean_output=True, parallel=True, densmap=False, densmap_kwds={}, output_dens=False, return_aux=False)

“”

Manifold Approximation and Projection, as proposed by Leland McInnes with an uniform distribution assumption in the seminal [UMAP algorithm](https://umap-learn.readthedocs.io/en/latest/index.html). Performs a fuzzy simplicial set embedding, using a specified initialisation method and then minimizing the fuzzy set cross entropy between the 1-skeletons of the high and low dimensional fuzzy simplicial sets. The fuzzy simplicial set embedding was proposed and implemented by Leland McInnes in UMAP (see umap-learn <https://github.com/lmcinnes/umap>). Here we’re using it only for the projection (layout optimization) by minimizing the cross-entropy between a phenotypic map (i.e. data, TopOMetry non-uniform latent mappings) and its graph topological representation.

Parameters
  • data (array of shape (n_samples, n_features)) – The source data to be embedded by UMAP. If None (default), the active basis will be used.

  • graph (scipy.sparse.csr_matrix (n_samples, n_samples)) – The 1-skeleton of the high dimensional fuzzy simplicial set as represented by a graph for which we require a sparse matrix for the (weighted) adjacency matrix. If None (default), a fuzzy simplicial set is computed with default parameters.

  • n_components (int (optional, default 2)) – The dimensionality of the euclidean space into which to embed the data.

  • initial_alpha (float (optional, default 1)) – Initial learning rate for the SGD.

  • gamma (float (optional, default 1.2)) – Weight to apply to negative samples.

  • negative_sample_rate (int (optional, default 5)) – The number of negative samples to select per positive sample in the optimization process. Increasing this value will result in greater repulsive force being applied, greater optimization cost, but slightly more accuracy.

  • n_epochs (int (optional, default 0)) – The number of training epochs to be used in optimizing the low dimensional embedding. Larger values result in more accurate embeddings. If 0 is specified a value will be selected based on the size of the input dataset (200 for large datasets, 500 for small).

  • init (string (optional, default 'spectral')) –

  • are (How to initialize the low dimensional embedding. Options) –

    • ‘spectral’: use a spectral embedding of the fuzzy 1-skeleton

    • ’random’: assign initial embedding positions at random.

    • A numpy array of initial embedding positions.

  • random_state (numpy RandomState or equivalent.) – A state capable being used as a numpy random state.

  • metric (string or callable.) – The metric used to measure distance in high dimensional space; used if multiple connected components need to be layed out. Defaults to TopOGraph.graph_metric.

  • metric_kwds (dict (optional, no default)) – Key word arguments to be passed to the metric function; used if multiple connected components need to be layed out.

  • densmap (bool (optional, default False)) – Whether to use the density-augmented objective function to optimize the embedding according to the densMAP algorithm.

  • densmap_kwds (dict (optional, no default)) – Key word arguments to be used by the densMAP optimization.

  • output_dens (bool (optional, default False)) – Whether to output local radii in the original data and the embedding.

  • output_metric (function (optional, no default)) – Function returning the distance between two points in embedding space and the gradient of the distance wrt the first argument.

  • output_metric_kwds (dict (optional, no default)) – Key word arguments to be passed to the output_metric function.

  • euclidean_output (bool (optional, default True)) – Whether to use the faster code specialised for euclidean output metrics

  • parallel (bool (optional, default True)) – Whether to run the computation using numba parallel. Running in parallel is non-deterministic, and is not used if a random seed has been set, to ensure reproducibility.

  • return_aux (bool , (optional, default False)) – Whether to also return the auxiliary data, i.e. initialization and local radii.

Returns

  • * embedding (array of shape (n_samples, n_components)) – The optimized of graph into an n_components dimensional euclidean space.

  • * return_aux is set to True – aux_data : dict

    Auxiliary dictionary output returned with the embedding. aux_data['Y_init']: array of shape (n_samples, n_components) The spectral initialization of graph into an n_components dimensional euclidean space.

    When densMAP extension is turned on, this dictionary includes local radii in the original data (aux_data['rad_orig']) and in the embedding (aux_data['rad_emb']).

MDE(basis=None, graph=None, n_components=2, n_neighbors=None, type='isomorphic', n_epochs=500, snapshot_every=30, constraint=None, init='quadratic', repulsive_fraction=None, max_distance=None, device='cpu', eps=0.001, mem_size=1)

This function constructs a Minimum Distortion Embedding (MDE) problem for preserving the

structure of original data. This MDE problem is well-suited for visualization (using dim 2 or 3), but can also be used to generate features for machine learning tasks (with dim = 10, 50, or 100, for example). It yields embeddings in which similar items are near each other, and dissimilar items are not near each other. The original data can either be a data matrix, or a graph. Data matrices should be torch Tensors, NumPy arrays, or scipy sparse matrices; graphs should be instances of pymde.Graph. The MDE problem uses distortion functions derived from weights (i.e., penalties). To obtain an embedding, call the embed method on the returned MDE object. To plot it, use pymde.plot.

Parameters
  • basis (str ('diffusion', 'continuous' or 'fuzzy')) – Which basis to use when computing the embedding. Defaults to the active basis.

  • graph (scipy.sparse matrix.) – The affinity matrix to embedd with. Defaults to the active graph. If init = ‘spectral’, a fuzzy simplicial set is used, and this argument is ignored.

  • n_components (int (optional, default 2)) – The embedding dimension. Use 2 or 3 for visualization.

  • constraint (str (optional, default 'standardized')) – Constraint to use when optimizing the embedding. Options are ‘standardized’, ‘centered’, None or a pymde.constraints.Constraint() function.

  • n_neighbors (int (optional)) – The number of nearest neighbors to compute for each row (item) of data. A sensible value is chosen by default, depending on the number of items.

  • repulsive_fraction (float (optional)) – How many repulsive edges to include, relative to the number of attractive edges. 1 means as many repulsive edges as attractive edges. The higher this number, the more uniformly spread out the embedding will be. Defaults to 0.5 for standardized embeddings, and 1 otherwise. (If repulsive_penalty is None, this argument is ignored.)

  • max_distance (float (optional)) – If not None, neighborhoods are restricted to have a radius no greater than max_distance.

  • init (str or np.ndarray (optional, default 'quadratic')) – Initialization strategy; np.ndarray, ‘quadratic’ or ‘random’.

  • device (str (optional)) – Device for the embedding (eg, ‘cpu’, ‘cuda’).

Returns

A pymde.MDE object, based on the original data.

Return type

torch.tensor

PaCMAP(data=None, init='spectral', n_components=2, n_neighbors=None, MN_ratio=0.5, FP_ratio=2.0, pair_neighbors=None, pair_MN=None, pair_FP=None, distance='angular', lr=1.0, num_iters=450, intermediate=False)

Performs Pairwise-Controlled Manifold Approximation and Projection.

Parameters
  • data (np.ndarray or scipy.sparse.csr_matrix (optional, default None)) – Data to be embedded. If None, will use the active orthogonal basis.

  • init (np.ndarray of shape (N,2) or str (optional, default 'spectral')) – Initialization positions. Defaults to a multicomponent spectral embedding (‘spectral’). Other options are ‘pca’ or ‘random’.

  • n_components (int (optional, default 2)) – How many components to embedd into.

  • n_neighbors (int (optional, default None)) – How many neighbors to use during embedding. If None, will use TopOGraph.graph_knn.

  • MN_ratio (float (optional, default 0.5)) – The ratio of the number of mid-near pairs to the number of neighbors, n_MN = n_neighbors * MN_ratio.

  • FP_ratio (float (optional, default 2.0)) – The ratio of the number of further pairs to the number of neighbors, n_FP = n_neighbors * FP_ratio.

  • distance (float (optional, default 'euclidean')) – Distance metric to use. Options are ‘euclidean’, ‘angular’, ‘manhattan’ and ‘hamming’.

  • lr (float (optional, default 1.0)) – Learning rate of the AdaGrad optimizer.

  • num_iters (int (optional, default 450)) – Number of iterations. The default 450 is enough for most dataset to converge.

  • intermediate (bool (optional, default False)) – Whether PaCMAP should also output the intermediate stages of the optimization process of the lower dimension embedding. If True, then the output will be a numpy array of the size (n, n_dims, 13), where each slice is a “screenshot” of the output embedding at a particular number of steps, from [0, 10, 30, 60, 100, 120, 140, 170, 200, 250, 300, 350, 450].

  • pair_neighbors (optional, default None.) – Pre-specified neighbor pairs. Allows user to use their own graphs. Default to None.

  • pair_MN (optional, default None.) – Pre-specified mid-near pairs. Allows user to use their own graphs. Default to None.

  • pair_FP (optional, default None.) – Pre-specified further pairs. Allows user to use their own graphs. Default to None.

Returns

Return type

PaCMAP embedding.

TriMAP(basis=None, graph=None, init=None, n_components=2, n_inliers=10, n_outliers=5, n_random=5, use_dist_matrix=False, metric='euclidean', lr=1000.0, n_iters=400, triplets=None, weights=None, knn_tuple=None, weight_adj=500.0, opt_method='dbd', return_seq=False)

Graph layout optimization using triplets.

Parameters
  • basis (str (optional, default None)) –

  • graph (str (optional, default None)) –

  • init (str (optional, default None)) –

  • n_components (int (optional, default 2)) – Number of dimensions of the embedding.

  • n_inliers (int (optional, default 10)) – Number of inlier points for triplet constraints.

  • n_outliers (int (optional, default 5)) – Number of outlier points for triplet constraints.

  • n_random (int (optional, default 5)) – Number of random triplet constraints per point.

  • metric (str (optional, default 'euclidean')) – Distance measure (‘euclidean’, ‘manhattan’, ‘angular’, ‘hamming’)

  • use_dist_matrix (bool (optional, default False)) – Use TopOMetry’s learned similarities between samples. As of now, this is unstable.

  • lr (int (optional, default 1000)) – Learning rate.

  • n_iters (int (optional, default 400)) – Number of iterations.

  • opt_method (str (optional, default 'dbd')) –

    Optimization method (‘sd’: steepest descent, ‘momentum’: GD with momentum,

    ’dbd’: GD with momentum delta-bar-delta).

  • return_seq (bool (optional, default False)) – Return the sequence of maps recorded every 10 iterations.

Returns

Return type

TriMAP embedding.

fit(X)

Learn topological distances with diffusion harmonics and continuous metrics. Computes affinity operators that approximate the Laplace-Beltrami operator

Parameters

X – High-dimensional data matrix. Currently, supports only data from similar type (i.e. all bool, all float)

Returns

  • TopoGraph instance with several slots, populated as per user settings.

  • If basis=’diffusion’, populates TopoGraph.MSDiffMap with a multiscale diffusion mapping of data, and – TopoGraph.DiffBasis with a fitted topo.tpgraph.diff.Diffusor() class containing diffusion metrics and transition probabilities, respectively stored in TopoGraph.DiffBasis.K and TopoGraph.DiffBasis.T

  • If basis=’continuous’, populates TopoGraph.CLapMap with a continous Laplacian Eigenmapping of data, and – TopoGraph.ContBasis with a continuous-k-nearest-neighbors model, containing continuous metrics and adjacency, respectively stored in TopoGraph.ContBasis.K and TopoGraph.ContBasis.A.

  • If basis=’fuzzy’, populates TopoGraph.FuzzyLapMap with a fuzzy Laplacian Eigenmapping of data, and – TopoGraph.FuzzyBasis with a fuzzy simplicial set model, containing continuous metrics.

run_layouts(X, n_components=2, bases=['diffusion', 'fuzzy', 'continuous'], graphs=['diff', 'cknn', 'fuzzy'], layouts=['tSNE', 'MAP', 'MDE', 'PaCMAP', 'TriMAP', 'NCVis'])

Master function to easily run all combinations of possible bases and graphs that approximate the [Laplace-Beltrami Operator](), and the 6 layout options within TopOMetry: tSNE, MAP, MDE, PaCMAP, TriMAP, and NCVis.

Parameters
  • X (np.ndarray or scipy.sparse.csr_matrix) – Data matrix.

  • n_components (int (optional, default 2)) – Number of components for visualization.

  • bases (str (optional, default ['diffusion', 'continuous','fuzzy'])) – Which bases to compute. Defaults to all. To run only one or two bases, set it to [‘fuzzy’, ‘diffusion’] or [‘continuous’], for exemple.

  • graphs (str (optional, default ['diff', 'cknn','fuzzy'])) – Which graphs to compute. Defaults to all. To run only one or two graphs, set it to [‘fuzzy’, ‘diff’] or [‘cknn’], for exemple.

  • layouts (str (optional, default all ['tSNE', 'MAP', 'MDE', 'PaCMAP', 'TriMAP', 'NCVis'])) – Which layouts to compute. Defaults to all 6 options within TopOMetry: tSNE, MAP, MDE, PaCMAP, TriMAP and NCVis. To run only one or two layouts, set it to [‘tSNE’, ‘MAP’] or [‘PaCMAP’], for exemple.

Returns

Return type

Populates the TopOMetry object slots

scree_plot(basis=None, use_eigs='knee', curve=None, verbose=False)

Visualize the scree plot of information entropy.

Parameters
  • basis (str (optional, default None)) – If None, will use the default basis. Otherwise, uses the specified basis (must be ‘diffusion’, ‘continuous’ or ‘fuzzy’).

  • use_eigs (int or str (optional, default 'knee')) – Number of eigenvectors to use. If ‘max’, expands to the maximum number of positive eigenvalues (reach of numerical precision), else to the maximum amount of computed components. If ‘knee’, uses Kneedle to find an optimal cutoff point, and expands it by expansion. If ‘comp_gap’, tries to find a discrete eigengap from the computation process.

  • verbose (bool (optional, default False)) – Controls verbosity

Returns

Return type

A nice plot.

spectral_layout(graph=None, n_components=2, cache=True)

Performs a multicomponent spectral layout of the data and the target similarity matrix.

Parameters
  • graph (scipy.sparse.csr.csr_matrix.) – affinity matrix (i.e. topological graph). If None (default), uses the default graph from the default basis.

  • n_components (int (optional, default 2)) – number of dimensions to embed into.

  • cache (bool (optional, default True)) – Whether to cache the embedding to the TopOGraph object.

Returns

Return type

np.ndarray containing the resulting embedding.

tSNE(data=None, graph=None, n_components=2, early_exaggeration=12, n_iter=1000, n_iter_early_exag=250, n_iter_without_progress=30, min_grad_norm=1e-07, init='random', random_state=None, angle=0.5, cheat_metric=True)

The classic t-SNE embedding, usually on top of a TopOMetry topological basis.

Parameters
  • data (optional, default None)) –

  • graph (optional, default None)) –

  • n_components (int (optional, default 2)) –

  • early_exaggeration (sets exaggeration) –

  • n_iter (number of iterations to optmizie) –

  • n_iter_early_exag (number of iterations in early exaggeration) –

  • init (np.ndarray (optional, defaults to tg.SpecLayout)) – Initialisation for the optimization problem.

  • random_state (optional, default None) –

transform(basis=None)

Learns new affinity, topological operators from chosen basis.

Parameters
  • self – TopOGraph instance.

  • basis (str, optional.) – Base to use when building the topological graph. Defaults to the active base ( TopOGraph.basis). Setting this updates the active base.

Returns

Return type

scipy.sparse.csr.csr_matrix, containing the similarity matrix that encodes the topological graph.