This document demonstrates the use of the
riemannian_stats
package to perform Riemannian
Principal Component Analysis (R-PCA) on a
high-dimensional synthetic dataset
(Data10D_250.csv
). R-PCA is a novel extension of standard
PCA that leverages the local geometry of data using a Riemannian
manifold structure induced via UMAP.
We showcase the full pipeline: loading and preprocessing the dataset, computing manifold-based metrics, extracting Riemannian principal components, and visualizing both the structure and correlation of the data.
from riemannian_stats import riemannian_analysis, visualization, data_processing, utilities
data = data_processing.load_data("./data/Data10D_250.csv", separator=",", decimal=".")
n_neighbors = int(len(data) / 5)
if 'cluster' in data.columns:
clusters = data['cluster']
data_with_clusters = data.copy()
data = data.iloc[:, :-1]
else:
clusters = None
data_with_clusters = data
We load the high-dimensional dataset with comma-separated values. The
cluster
column is used to extract cluster labels, which are
removed before statistical analysis.
analysis = riemannian_analysis(data, n_neighbors=n_neighbors)
## C:\Anaconda\envs\RIEMAN~1\Lib\site-packages\sklearn\utils\deprecation.py:151: FutureWarning: 'force_all_finite' was renamed to 'ensure_all_finite' in 1.6 and will be removed in 1.8.
## warnings.warn(
We initialize the Riemannian analysis instance, using
k = len(data)/5
as the number of neighbors, based on the
expected number of clusters.
umap_similarities = analysis.umap_similarities
print("UMAP Similarities Matrix:\n", umap_similarities)
## UMAP Similarities Matrix:
## [[0. 0. 0.10272729 ... 0. 0.04154779 0. ]
## [0. 0. 0. ... 0. 0. 0. ]
## [0.10272729 0. 0. ... 0. 0. 0. ]
## ...
## [0. 0. 0. ... 0. 0.11827033 0. ]
## [0.04154779 0. 0. ... 0.11827033 0. 0. ]
## [0. 0. 0. ... 0. 0. 0. ]]
rho = analysis.rho
print("Rho Matrix:\n", rho)
## Rho Matrix:
## [[1. 1. 0.8972727 ... 1. 0.9584522 1. ]
## [1. 1. 1. ... 1. 1. 1. ]
## [0.8972727 1. 1. ... 1. 1. 1. ]
## ...
## [1. 1. 1. ... 1. 0.88172966 1. ]
## [0.9584522 1. 1. ... 0.88172966 1. 1. ]
## [1. 1. 1. ... 1. 1. 1. ]]
UMAP similarities define local neighborhood structure. The rho matrix encodes local scaling for each point, forming the foundation of the Riemannian metric.
riemannian_diff = analysis.riemannian_diff
print("Riemannian Vector Differences:\n", riemannian_diff)
## Riemannian Vector Differences:
## [[[ 0. 0. 0. ... 0. 0.
## 0. ]
## [ 0.49167881 -9.35504083 0.73083244 ... -1.69081857 -0.53916487
## -2.42982678]
## [ -2.45635885 -3.06323063 -2.23033423 ... -0.70641431 -0.58870373
## -1.06245603]
## ...
## [ 2.12384659 5.18906417 1.58827847 ... -0.18652492 -2.55233115
## -1.8777029 ]
## [ 2.13164278 4.69335938 -0.11454679 ... -0.58611284 -0.23922768
## -0.5809582 ]
## [ 4.34901569 7.87640137 -1.18914482 ... -2.04384362 -1.12418008
## -1.99329798]]
##
## [[ -0.49167881 9.35504083 -0.73083244 ... 1.69081857 0.53916487
## 2.42982678]
## [ 0. 0. 0. ... 0. 0.
## 0. ]
## [ -3.22926219 5.9411059 -3.21651401 ... 0.90352803 -0.11693859
## 1.24573189]
## ...
## [ 1.63216777 14.54410499 0.85744603 ... 1.50429365 -2.01316628
## 0.55212388]
## [ 1.73236817 14.25185181 -0.8503447 ... 1.07929843 0.28956696
## 1.82368473]
## [ 3.85733687 17.2314422 -1.91997726 ... -0.35302505 -0.58501521
## 0.43652879]]
##
## [[ 2.45635885 3.06323063 2.23033423 ... 0.70641431 0.58870373
## 1.06245603]
## [ 3.22926219 -5.9411059 3.21651401 ... -0.90352803 0.11693859
## -1.24573189]
## [ 0. 0. 0. ... 0. 0.
## 0. ]
## ...
## [ 4.86142997 8.6029991 4.07396005 ... 0.60076562 -1.89622769
## -0.69360801]
## [ 4.96163037 8.31074591 2.36616931 ... 0.1757704 0.40650555
## 0.57795284]
## [ 7.08659907 11.29033631 1.29653675 ... -1.25655308 -0.46807661
## -0.80920309]]
##
## ...
##
## [[ -2.12384659 -5.18906417 -1.58827847 ... 0.18652492 2.55233115
## 1.8777029 ]
## [ -1.63216777 -14.54410499 -0.85744603 ... -1.50429365 2.01316628
## -0.55212388]
## [ -4.86142997 -8.6029991 -4.07396005 ... -0.60076562 1.89622769
## 0.69360801]
## ...
## [ 0. 0. 0. ... 0. 0.
## 0. ]
## [ 0.08834967 -0.2576883 -1.50580975 ... -0.37473089 2.0303882
## 1.12117292]
## [ 2.2251691 2.68733721 -2.7774233 ... -1.8573187 1.42815107
## -0.11559509]]
##
## [[ -2.13164278 -4.69335938 0.11454679 ... 0.58611284 0.23922768
## 0.5809582 ]
## [ -1.73236817 -14.25185181 0.8503447 ... -1.07929843 -0.28956696
## -1.82368473]
## [ -4.96163037 -8.31074591 -2.36616931 ... -0.1757704 -0.40650555
## -0.57795284]
## ...
## [ -0.08834967 0.2576883 1.50580975 ... 0.37473089 -2.0303882
## -1.12117292]
## [ 0. 0. 0. ... 0. 0.
## 0. ]
## [ 2.1249687 2.9795904 -1.06963256 ... -1.43232348 -0.87458216
## -1.38715593]]
##
## [[ -4.34901569 -7.87640137 1.18914482 ... 2.04384362 1.12418008
## 1.99329798]
## [ -3.85733687 -17.2314422 1.91997726 ... 0.35302505 0.58501521
## -0.43652879]
## [ -7.08659907 -11.29033631 -1.29653675 ... 1.25655308 0.46807661
## 0.80920309]
## ...
## [ -2.2251691 -2.68733721 2.7774233 ... 1.8573187 -1.42815107
## 0.11559509]
## [ -2.1249687 -2.9795904 1.06963256 ... 1.43232348 0.87458216
## 1.38715593]
## [ 0. 0. 0. ... 0. 0.
## 0. ]]]
umap_distance_matrix = analysis.umap_distance_matrix
print("UMAP Distance Matrix:\n", umap_distance_matrix)
## UMAP Distance Matrix:
## [[ 0. 10.00338788 5.0584905 ... 7.00964061 6.148487
## 9.84428295]
## [10.00338788 0. 8.22975924 ... 15.01864768 15.07113502
## 17.80772563]
## [ 5.0584905 8.22975924 0. ... 11.07118093 10.11706575
## 13.91019089]
## ...
## [ 7.00964061 15.01864768 11.07118093 ... 0. 3.44858122
## 5.42125048]
## [ 6.148487 15.07113502 10.11706575 ... 3.44858122 0.
## 6.04267766]
## [ 9.84428295 17.80772563 13.91019089 ... 5.42125048 6.04267766
## 0. ]]
We compute the Riemannian vector difference and UMAP-induced distances to quantify local geometrical deviation between samples.
riemann_corr = analysis.riemannian_correlation_matrix()
print("Riemannian Correlation Matrix:\n", riemann_corr)
## Riemannian Correlation Matrix:
## [[ 1. 0.45273855 0.13331328 -0.04184126 0.09649943 0.20460164
## 0.12250863 0.05726846 0.06236543 0.11616806]
## [ 0.45273855 1. -0.10806913 -0.02340368 -0.08123159 -0.00412076
## -0.00695881 -0.01536835 -0.03696684 0.02381021]
## [ 0.13331328 -0.10806913 1. 0.07218516 0.31681631 0.48321739
## 0.16509523 0.21368007 0.24509065 0.44255394]
## [-0.04184126 -0.02340368 0.07218516 1. 0.0236562 0.05200082
## 0.04823302 0.04817491 0.101584 0.04828608]
## [ 0.09649943 -0.08123159 0.31681631 0.0236562 1. 0.49624761
## 0.13871006 0.17433483 0.41503529 0.35090023]
## [ 0.20460164 -0.00412076 0.48321739 0.05200082 0.49624761 1.
## 0.11447819 0.28244363 0.36515385 0.60996974]
## [ 0.12250863 -0.00695881 0.16509523 0.04823302 0.13871006 0.11447819
## 1. 0.11132107 0.11652943 0.11246946]
## [ 0.05726846 -0.01536835 0.21368007 0.04817491 0.17433483 0.28244363
## 0.11132107 1. 0.14385986 0.26334555]
## [ 0.06236543 -0.03696684 0.24509065 0.101584 0.41503529 0.36515385
## 0.11652943 0.14385986 1. 0.36130211]
## [ 0.11616806 0.02381021 0.44255394 0.04828608 0.35090023 0.60996974
## 0.11246946 0.26334555 0.36130211 1. ]]
riemann_components = analysis.riemannian_components_from_data_and_correlation(riemann_corr)
print("Riemannian Principal Components:\n", riemann_components)
## Riemannian Principal Components:
## [[ 0. 0. 0. ... 0. 0.
## 0. ]
## [-1.12000718 -1.1488369 1.38310433 ... 0.68307103 -0.54132528
## -0.65762422]
## [-2.09432963 -0.37825639 -1.01443784 ... -0.53233017 -0.54211393
## 0.26006884]
## ...
## [-1.74625799 0.82729714 0.57429004 ... 0.49121132 1.17291308
## 0.15839634]
## [-1.73914003 0.88846505 -0.87851719 ... 0.86572755 0.64447996
## 1.08567023]
## [-2.12619146 1.46498404 1.60965354 ... -0.48458834 -0.30057755
## -0.95568832]]
Principal components are derived from the Riemannian correlation matrix, capturing variance in the intrinsic geometry of the data.
comp1, comp2 = 0, 1
inertia = utilities.pca_inertia_by_components(riemann_corr, comp1, comp2) * 100
print(f"Explained Inertia (PC1 & PC2): {inertia:.2f}%")
## Explained Inertia (PC1 & PC2): 43.61%
The explained inertia quantifies the proportion of Riemannian variance captured by the first two components.
correlations = analysis.riemannian_correlation_variables_components(riemann_components)
print("Correlations Variables vs Components:\n", correlations)
## Correlations Variables vs Components:
## Component_1 Component_2
## feature_1 -0.266064 -0.814317
## feature_2 0.017446 -0.854837
## feature_3 -0.674133 0.091343
## feature_4 -0.12044 0.153522
## feature_5 -0.677698 0.115575
## feature_6 -0.824299 -0.023679
## feature_7 -0.283797 -0.088618
## feature_8 -0.437629 0.03141
## feature_9 -0.609039 0.116612
## feature_10 -0.755617 0.002371
We compute how strongly each original variable is correlated with the Riemannian principal components.
if clusters is not None:
viz = visualization(data=data_with_clusters,
components=riemann_components,
explained_inertia=inertia,
clusters=clusters)
viz.plot_2d_scatter_with_clusters(x_col="x", y_col="y", cluster_col="cluster", title="Data10D_250.csv")
viz.plot_principal_plane_with_clusters(title="Data10D_250.csv")
viz.plot_3d_scatter_with_clusters(x_col="x", y_col="y", z_col="var1", cluster_col="cluster",
title="Data10D_250.csv", figsize=(12, 8))
else:
viz = visualization(data=data,
components=riemann_components,
explained_inertia=inertia)
viz.plot_principal_plane(title="Data10D_250.csv")
viz.plot_correlation_circle(correlations=correlations, title="Data10D_250.csv")
We visualize the structure of the dataset via:
This analysis demonstrates how the Riemannian STATS package enables manifold-aware analysis for complex high-dimensional datasets. By transforming the data into a Riemannian space using UMAP, it becomes possible to preserve local structure and produce meaningful low-dimensional embeddings for clustering, interpretation, and visualization.