Riemannian Principal Component Analysis

This document demonstrates the use of the riemannian_stats package to perform Riemannian Principal Component Analysis (R-PCA) on a high-dimensional synthetic dataset (Data10D_250.csv). R-PCA is a novel extension of standard PCA that leverages the local geometry of data using a Riemannian manifold structure induced via UMAP.

We showcase the full pipeline: loading and preprocessing the dataset, computing manifold-based metrics, extracting Riemannian principal components, and visualizing both the structure and correlation of the data.


📦 Load Required Modules

from riemannian_stats import riemannian_analysis, visualization, data_processing, utilities

📄 Load and Prepare Data

data = data_processing.load_data("./data/Data10D_250.csv", separator=",", decimal=".")
n_neighbors = int(len(data) / 5)

if 'cluster' in data.columns:
    clusters = data['cluster']
    data_with_clusters = data.copy()
    data = data.iloc[:, :-1]
else:
    clusters = None
    data_with_clusters = data

We load the high-dimensional dataset with comma-separated values. The cluster column is used to extract cluster labels, which are removed before statistical analysis.


🔍 Create Analysis Instance

analysis = riemannian_analysis(data, n_neighbors=n_neighbors)
## C:\Anaconda\envs\RIEMAN~1\Lib\site-packages\sklearn\utils\deprecation.py:151: FutureWarning: 'force_all_finite' was renamed to 'ensure_all_finite' in 1.6 and will be removed in 1.8.
##   warnings.warn(

We initialize the Riemannian analysis instance, using k = len(data)/5 as the number of neighbors, based on the expected number of clusters.


📈 UMAP Similarities and Rho Matrix

umap_similarities = analysis.umap_similarities
print("UMAP Similarities Matrix:\n", umap_similarities)
## UMAP Similarities Matrix:
##  [[0.         0.         0.10272729 ... 0.         0.04154779 0.        ]
##  [0.         0.         0.         ... 0.         0.         0.        ]
##  [0.10272729 0.         0.         ... 0.         0.         0.        ]
##  ...
##  [0.         0.         0.         ... 0.         0.11827033 0.        ]
##  [0.04154779 0.         0.         ... 0.11827033 0.         0.        ]
##  [0.         0.         0.         ... 0.         0.         0.        ]]
rho = analysis.rho
print("Rho Matrix:\n", rho)
## Rho Matrix:
##  [[1.         1.         0.8972727  ... 1.         0.9584522  1.        ]
##  [1.         1.         1.         ... 1.         1.         1.        ]
##  [0.8972727  1.         1.         ... 1.         1.         1.        ]
##  ...
##  [1.         1.         1.         ... 1.         0.88172966 1.        ]
##  [0.9584522  1.         1.         ... 0.88172966 1.         1.        ]
##  [1.         1.         1.         ... 1.         1.         1.        ]]

UMAP similarities define local neighborhood structure. The rho matrix encodes local scaling for each point, forming the foundation of the Riemannian metric.


🔁 Vector Differences and UMAP-Based Distances

riemannian_diff = analysis.riemannian_diff
print("Riemannian Vector Differences:\n", riemannian_diff)
## Riemannian Vector Differences:
##  [[[  0.           0.           0.         ...   0.           0.
##      0.        ]
##   [  0.49167881  -9.35504083   0.73083244 ...  -1.69081857  -0.53916487
##     -2.42982678]
##   [ -2.45635885  -3.06323063  -2.23033423 ...  -0.70641431  -0.58870373
##     -1.06245603]
##   ...
##   [  2.12384659   5.18906417   1.58827847 ...  -0.18652492  -2.55233115
##     -1.8777029 ]
##   [  2.13164278   4.69335938  -0.11454679 ...  -0.58611284  -0.23922768
##     -0.5809582 ]
##   [  4.34901569   7.87640137  -1.18914482 ...  -2.04384362  -1.12418008
##     -1.99329798]]
## 
##  [[ -0.49167881   9.35504083  -0.73083244 ...   1.69081857   0.53916487
##      2.42982678]
##   [  0.           0.           0.         ...   0.           0.
##      0.        ]
##   [ -3.22926219   5.9411059   -3.21651401 ...   0.90352803  -0.11693859
##      1.24573189]
##   ...
##   [  1.63216777  14.54410499   0.85744603 ...   1.50429365  -2.01316628
##      0.55212388]
##   [  1.73236817  14.25185181  -0.8503447  ...   1.07929843   0.28956696
##      1.82368473]
##   [  3.85733687  17.2314422   -1.91997726 ...  -0.35302505  -0.58501521
##      0.43652879]]
## 
##  [[  2.45635885   3.06323063   2.23033423 ...   0.70641431   0.58870373
##      1.06245603]
##   [  3.22926219  -5.9411059    3.21651401 ...  -0.90352803   0.11693859
##     -1.24573189]
##   [  0.           0.           0.         ...   0.           0.
##      0.        ]
##   ...
##   [  4.86142997   8.6029991    4.07396005 ...   0.60076562  -1.89622769
##     -0.69360801]
##   [  4.96163037   8.31074591   2.36616931 ...   0.1757704    0.40650555
##      0.57795284]
##   [  7.08659907  11.29033631   1.29653675 ...  -1.25655308  -0.46807661
##     -0.80920309]]
## 
##  ...
## 
##  [[ -2.12384659  -5.18906417  -1.58827847 ...   0.18652492   2.55233115
##      1.8777029 ]
##   [ -1.63216777 -14.54410499  -0.85744603 ...  -1.50429365   2.01316628
##     -0.55212388]
##   [ -4.86142997  -8.6029991   -4.07396005 ...  -0.60076562   1.89622769
##      0.69360801]
##   ...
##   [  0.           0.           0.         ...   0.           0.
##      0.        ]
##   [  0.08834967  -0.2576883   -1.50580975 ...  -0.37473089   2.0303882
##      1.12117292]
##   [  2.2251691    2.68733721  -2.7774233  ...  -1.8573187    1.42815107
##     -0.11559509]]
## 
##  [[ -2.13164278  -4.69335938   0.11454679 ...   0.58611284   0.23922768
##      0.5809582 ]
##   [ -1.73236817 -14.25185181   0.8503447  ...  -1.07929843  -0.28956696
##     -1.82368473]
##   [ -4.96163037  -8.31074591  -2.36616931 ...  -0.1757704   -0.40650555
##     -0.57795284]
##   ...
##   [ -0.08834967   0.2576883    1.50580975 ...   0.37473089  -2.0303882
##     -1.12117292]
##   [  0.           0.           0.         ...   0.           0.
##      0.        ]
##   [  2.1249687    2.9795904   -1.06963256 ...  -1.43232348  -0.87458216
##     -1.38715593]]
## 
##  [[ -4.34901569  -7.87640137   1.18914482 ...   2.04384362   1.12418008
##      1.99329798]
##   [ -3.85733687 -17.2314422    1.91997726 ...   0.35302505   0.58501521
##     -0.43652879]
##   [ -7.08659907 -11.29033631  -1.29653675 ...   1.25655308   0.46807661
##      0.80920309]
##   ...
##   [ -2.2251691   -2.68733721   2.7774233  ...   1.8573187   -1.42815107
##      0.11559509]
##   [ -2.1249687   -2.9795904    1.06963256 ...   1.43232348   0.87458216
##      1.38715593]
##   [  0.           0.           0.         ...   0.           0.
##      0.        ]]]
umap_distance_matrix = analysis.umap_distance_matrix
print("UMAP Distance Matrix:\n", umap_distance_matrix)
## UMAP Distance Matrix:
##  [[ 0.         10.00338788  5.0584905  ...  7.00964061  6.148487
##    9.84428295]
##  [10.00338788  0.          8.22975924 ... 15.01864768 15.07113502
##   17.80772563]
##  [ 5.0584905   8.22975924  0.         ... 11.07118093 10.11706575
##   13.91019089]
##  ...
##  [ 7.00964061 15.01864768 11.07118093 ...  0.          3.44858122
##    5.42125048]
##  [ 6.148487   15.07113502 10.11706575 ...  3.44858122  0.
##    6.04267766]
##  [ 9.84428295 17.80772563 13.91019089 ...  5.42125048  6.04267766
##    0.        ]]

We compute the Riemannian vector difference and UMAP-induced distances to quantify local geometrical deviation between samples.


📊 Correlation Matrix and Principal Components

riemann_corr = analysis.riemannian_correlation_matrix()
print("Riemannian Correlation Matrix:\n", riemann_corr)
## Riemannian Correlation Matrix:
##  [[ 1.          0.45273855  0.13331328 -0.04184126  0.09649943  0.20460164
##    0.12250863  0.05726846  0.06236543  0.11616806]
##  [ 0.45273855  1.         -0.10806913 -0.02340368 -0.08123159 -0.00412076
##   -0.00695881 -0.01536835 -0.03696684  0.02381021]
##  [ 0.13331328 -0.10806913  1.          0.07218516  0.31681631  0.48321739
##    0.16509523  0.21368007  0.24509065  0.44255394]
##  [-0.04184126 -0.02340368  0.07218516  1.          0.0236562   0.05200082
##    0.04823302  0.04817491  0.101584    0.04828608]
##  [ 0.09649943 -0.08123159  0.31681631  0.0236562   1.          0.49624761
##    0.13871006  0.17433483  0.41503529  0.35090023]
##  [ 0.20460164 -0.00412076  0.48321739  0.05200082  0.49624761  1.
##    0.11447819  0.28244363  0.36515385  0.60996974]
##  [ 0.12250863 -0.00695881  0.16509523  0.04823302  0.13871006  0.11447819
##    1.          0.11132107  0.11652943  0.11246946]
##  [ 0.05726846 -0.01536835  0.21368007  0.04817491  0.17433483  0.28244363
##    0.11132107  1.          0.14385986  0.26334555]
##  [ 0.06236543 -0.03696684  0.24509065  0.101584    0.41503529  0.36515385
##    0.11652943  0.14385986  1.          0.36130211]
##  [ 0.11616806  0.02381021  0.44255394  0.04828608  0.35090023  0.60996974
##    0.11246946  0.26334555  0.36130211  1.        ]]
riemann_components = analysis.riemannian_components_from_data_and_correlation(riemann_corr)
print("Riemannian Principal Components:\n", riemann_components)
## Riemannian Principal Components:
##  [[ 0.          0.          0.         ...  0.          0.
##    0.        ]
##  [-1.12000718 -1.1488369   1.38310433 ...  0.68307103 -0.54132528
##   -0.65762422]
##  [-2.09432963 -0.37825639 -1.01443784 ... -0.53233017 -0.54211393
##    0.26006884]
##  ...
##  [-1.74625799  0.82729714  0.57429004 ...  0.49121132  1.17291308
##    0.15839634]
##  [-1.73914003  0.88846505 -0.87851719 ...  0.86572755  0.64447996
##    1.08567023]
##  [-2.12619146  1.46498404  1.60965354 ... -0.48458834 -0.30057755
##   -0.95568832]]

Principal components are derived from the Riemannian correlation matrix, capturing variance in the intrinsic geometry of the data.


🧮 Explained Inertia

comp1, comp2 = 0, 1
inertia = utilities.pca_inertia_by_components(riemann_corr, comp1, comp2) * 100
print(f"Explained Inertia (PC1 & PC2): {inertia:.2f}%")
## Explained Inertia (PC1 & PC2): 43.61%

The explained inertia quantifies the proportion of Riemannian variance captured by the first two components.


🔗 Correlation Between Variables and Components

correlations = analysis.riemannian_correlation_variables_components(riemann_components)
print("Correlations Variables vs Components:\n", correlations)
## Correlations Variables vs Components:
##             Component_1 Component_2
## feature_1    -0.266064   -0.814317
## feature_2     0.017446   -0.854837
## feature_3    -0.674133    0.091343
## feature_4     -0.12044    0.153522
## feature_5    -0.677698    0.115575
## feature_6    -0.824299   -0.023679
## feature_7    -0.283797   -0.088618
## feature_8    -0.437629     0.03141
## feature_9    -0.609039    0.116612
## feature_10   -0.755617    0.002371

We compute how strongly each original variable is correlated with the Riemannian principal components.


📊 Visualizations

if clusters is not None:
    viz = visualization(data=data_with_clusters,
                        components=riemann_components,
                        explained_inertia=inertia,
                        clusters=clusters)
    viz.plot_2d_scatter_with_clusters(x_col="x", y_col="y", cluster_col="cluster", title="Data10D_250.csv")
    viz.plot_principal_plane_with_clusters(title="Data10D_250.csv")
    viz.plot_3d_scatter_with_clusters(x_col="x", y_col="y", z_col="var1", cluster_col="cluster",
                                      title="Data10D_250.csv", figsize=(12, 8))
else:
    viz = visualization(data=data,
                        components=riemann_components,
                        explained_inertia=inertia)
    viz.plot_principal_plane(title="Data10D_250.csv")

viz.plot_correlation_circle(correlations=correlations, title="Data10D_250.csv")

We visualize the structure of the dataset via:

  • A 2D scatter plot with cluster coloring
  • A 3D manifold projection
  • The principal plane projection
  • A correlation circle of variables vs components

✅ Summary

This analysis demonstrates how the Riemannian STATS package enables manifold-aware analysis for complex high-dimensional datasets. By transforming the data into a Riemannian space using UMAP, it becomes possible to preserve local structure and produce meaningful low-dimensional embeddings for clustering, interpretation, and visualization.