3.1.5. Calculating path similarity — MDAnalysis.analysis.psa

Author:Sean Seyler
Year:2015
Copyright:GNU Public License v3

New in version 0.10.0.

The module contains code to calculate the geometric similarity of trajectories using path metrics such as the Hausdorff or Fréchet distances [Seyler2015]. The path metrics are functions of two paths and return a nonnegative number, i.e., a distance. Two paths are identical if their distance is zero, and large distances indicate dissimilarity. Each path metric is a function of the individual points (e.g., coordinate snapshots) that comprise each path and, loosely speaking, identify the two points, one per path of a pair of paths, where the paths deviate the most. The distance between these points of maximal deviation is measured by the root mean square deviation (RMSD), i.e., to compute structural similarity.

One typically computes the pairwise similarity for an ensemble of paths to produce a symmetric distance matrix, which can be clustered to, at a glance, identify patterns in the trajectory data. To properly analyze a path ensemble, one must select a suitable reference structure to which all paths (each conformer in each path) will be universally aligned using the rotations determined by the best-fit rmsds. Distances between paths and their structures are then computed directly with no further alignment. This pre-processing step is necessary to preserve the metric properties of the Hausdorff and Fréchet metrics; using the best-fit rmsd on a pairwise basis does not generally preserve the triangle inequality.

See also

The PSAnalysisTutorial outlines a typical application of PSA to a set of trajectories, including doing proper alignment, performing distance comparisons, and generating heat map-dendrogram plots from hierarchical clustering.

References

[Seyler2015]Sean L. Seyler, Avishek Kumar, Michael F. Thorpe, Oliver Beckstein. Path Similarity Analysis: a Method for Quantifying Macromolecular Pathways. arXiv:1505.04807 (2015).

3.1.5.1. Helper functions and variables

The following global variables are used by the functions and classes in this module.

MDAnalysis.analysis.psa.hausdorff_names
MDAnalysis.analysis.psa.frechet_names

The following functions are used by the other functions in this module.

MDAnalysis.analysis.psa.sqnorm(v, axis=None)[source]

Compute the sum of squares of elements along specified axes.

Arguments:
v

numpy.ndarray of coordinates

axis

None or int or tuple of ints, optional Axis or axes along which a sum is performed. The default (axis = None) is perform a sum over all the dimensions of the input array. axis may be negative, in which case it counts from the last to the first axis.

Returns:

float, the sum of the squares of the elements of v along axis

3.1.5.2. Classes, methods, and functions

MDAnalysis.analysis.psa.get_path_metric_func(name)[source]

Selects a path metric function by name.

Arguments:
name

string, name of path metric

Returns:

The path metric function specified by name (if found).

MDAnalysis.analysis.psa.hausdorff(P, Q, N=None)[source]

Calculate the Hausdorff distance between two paths.

P (Q) is a numpy.ndarray of \(N_p\) (\(N_q\)) time steps, \(N\) atoms, and \(3N\) coordinates (e.g., MDAnalysis.core.AtomGroup.AtomGroup.coordinates()). P (Q) has either shape \(N_p\times N\times 3\) (\(N_q\times N\times 3\)), or \(N_p\times(3N)\) (\(N_q\times(3N)\)) in flattened form.

Arguments:
P

numpy.ndarray representing a path

Q

numpy.ndarray representing a path

N

int, number of atoms; if None then P and Q are each assumed to be a 3D numpy.ndarray; else, they are 2D (flattened)

Returns:

float, the Hausdorff distance between paths P and Q

Example::
>>> u = Universe(PSF,DCD)
>>> mid = len(u.trajectory)/2
>>> ca = u.selectAtoms('name CA')
>>> P = numpy.array([
...                ca.positions for _ in u.trajectory[:mid:]
...              ]) # first half of trajectory
>>> Q = numpy.array([
...                ca.positions for _ in u.trajectory[mid::]
...              ]) # second half of trajectory
>>> hausdorff(P,Q)
4.7786639840135905
>>> hausdorff(P,Q[::-1]) # hausdorff distance w/ reversed 2nd trajectory
4.7786639840135905
MDAnalysis.analysis.psa.discrete_frechet(P, Q, N=None)[source]

Calculate the discrete Frechet distance between two paths.

P (Q) is a numpy.ndarray of \(N_p\) (\(N_q\)) time steps, \(N\) atoms, and \(3N\) coordinates (e.g., MDAnalysis.core.AtomGroup.AtomGroup.coordinates()). P (Q) has either shape \(N_p\times N\times 3\) (\(N_q\times N\times 3\)), or :\(N_p\times(3N)\) (\(N_q\times(3N)\)) in flattened form.

Arguments:
P

numpy.ndarray representing a path

Q

numpy.ndarray representing a path

N

int, number of atoms; if None then P and Q are each assumed to be a 3D numpy.ndarray; else, they are 2D (flattened)

Returns:

float, the discrete Frechet distance between paths P and Q

Example::
>>> u = Universe(PSF,DCD)
>>> mid = len(u.trajectory)/2
>>> ca = u.selectAtoms('name CA')
>>> P = numpy.array([
...                ca.positions for _ in u.trajectory[:mid:]
...              ]) # first half of trajectory
>>> Q = numpy.array([
...                ca.positions for _ in u.trajectory[mid::]
...              ]) # second half of trajectory
>>> discrete_frechet(P,Q)
4.7786639840135905
>>> discrete_frechet(P,Q[::-1]) # frechet distance w/ 2nd trj reversed 2nd
6.8429011177113832
class MDAnalysis.analysis.psa.Path(universe, reference, ref_select='name CA', path_select='all', ref_frame=0)[source]

Pre-process a MDAnalysis.Universe object: (1) fit the trajectory to a reference structure, (2) convert fitted time series to a numpy.ndarray representation of Path.path.

The analysis is performed with PSA.run() and stores the result in the numpy.ndarray distance matrix PSA.D. PSA.run() also generates a fitted trajectory and path from alignment of the original trajectories to a reference structure.

New in version 0.9.1.

Setting up trajectory alignment and fitted path generation.

Arguments:
universe

MDAnalysis.Universe object containing a trajectory

reference

reference structure; MDAnalysis.Universe object; if None then traj is used (uses the current time step of the object) [None]

ref_select

The selection to operate on for rms fitting; can be one of:

  1. any valid selection string for selectAtoms() that produces identical selections in mobile and reference; or
  2. a dictionary {'mobile':sel1, 'reference':sel2} (the MDAnalysis.analysis.align.fasta2select() function returns such a dictionary based on a ClustalW_ or STAMP_ sequence alignment); or
  3. a tuple (sel1, sel2)

When using 2. or 3. with sel1 and sel2 then these selections can also each be a list of selection strings (to generate an AtomGroup with defined atom order as described under Ordered selections).

path_select

atom selection composing coordinates of (fitted) path; if None then path_select is set to ref_select [None]

u_original

MDAnalysis.Universe object with a trajectory

u_reference

MDAnalysis.Universe object containing a reference structure

ref_select

string, selection for selectAtoms() to select frame from Path.u_reference

path_select

string, selection for selectAtoms() to select atoms to compose Path.path

ref_frame

int, frame index to select frame from Path.u_reference

u_fitted

MDAnalysis.Universe object with the fitted trajectory

path

numpy.ndarray object representation of the fitted trajectory

fit_to_reference(filename=None, prefix='', postfix='_fit', rmsdfile=None, targetdir='.', mass_weighted=False, tol_mass=0.1)[source]

Align each trajectory frame to the reference structure with MDAnalysis.analysis.align.rms_fit_trj().

Arguments:
filename

file name for the RMS-fitted trajectory or pdb; defaults to the original trajectory filename (from Path.u_original) with prefix prepended

prefix

prefix for auto-generating the new output filename

rmsdfile

file name for writing the RMSD time series [None]

mass_weighted

do a mass-weighted RMSD fit

tol_mass

Reject match if the atomic masses for matched atoms differ by more than tol_mass [0.1]

Returns:

MDAnalysis.Universe object containing a fitted trajectory

run(align=False, filename=None, postfix='_fit', rmsdfile=None, targetdir='.', mass_weighted=False, tol_mass=0.1, flat=False)[source]

Generate a path from a trajectory and reference structure, aligning to a reference structure if specified.

This is a convenience method to generate a fitted trajectory from an inputted universe (Path.u_original) and reference structure (Path.u_reference). Path.fit_to_reference() and Path.to_path() are used consecutively to generate a new universe (Path.u_fitted) containing the fitted trajectory along with the corresponding Path.path represented as an numpy.ndarray. The method returns a tuple of the topology name and new trajectory name, which can be fed directly into an MDAnalysis.Universe object after unpacking the tuple using the * operator, as in MDAnalysis.Universe(*(top_name, newtraj_name)).

Arguments:
align

Align trajectory to atom selection Path.ref_select of Path.u_reference. If True, a universe containing an aligned trajectory is produced with Path.fit_to_reference() [False]

filename

filename for the RMS-fitted trajectory or pdb; defaults to the original trajectory filename (from Path.u_original) with prefix prepended

prefix

prefix for auto-generating the new output filename

rmsdfile

file name for writing the RMSD time series [None]

mass_weighted

do a mass-weighted RMSD fit

tol_mass

Reject match if the atomic masses for matched atoms differ by more than tol_mass [0.1]

flat

represent Path.path with a 2D (\(N_p\times 3N\)) numpy.ndarray; if False then Path.path is a 3D (\(N_p\times N\times 3\)) numpy.ndarray [False]

Returns:

A tuple of the topology name and new trajectory name.

to_path(fitted=False, select=None, flat=False)[source]

Generates a coordinate time series from the fitted universe trajectory.

Given a selection of N atoms from select, the atomic positions for each frame in the fitted universe (Path.u_fitted) trajectory (with \(N_p\) total frames) are appended sequentially to form a 3D or 2D (if flat is True) numpy.ndarray representation of the fitted trajectory (with dimensions \(N_p\times N\times 3\) or \(N_p\times 3N\), respectively).

Arguments:
fitted

construct a Path.path from the Path.u_fitted trajectory; if False then Path.path is generated with the trajectory from Path.u_original [False]

select

the selection for constructing the coordinates of each frame in Path.path; if None then Path.path_select is used, else it is overridden by select [None]

flat

represent Path.path as a 2D (\(N_p\times 3N\)) numpy.ndarray; if False then Path.path is a 3D (\(N_p\times N\times 3\)) numpy.ndarray [False]

Returns:

numpy.ndarray representing a time series of atomic positions of an MDAnalysis.core.AtomGroup.AtomGroup selection from Path.u_fitted.trajectory

class MDAnalysis.analysis.psa.PSA(universes, reference=None, ref_select='name CA', ref_frame=0, path_select=None, labels=None, targetdir=None)[source]

Perform Path Similarity Analysis (PSA) on a set of trajectories.

The analysis is performed with PSA.run() and stores the result in the numpy.ndarray distance matrix PSA.D. PSA.run() also generates a fitted trajectory and path from alignment of the original trajectories to a reference structure.

New in version 0.8.

Setting up Path Similarity Analysis.

The mutual similarity between all unique pairs of trajectories are computed using a selected path metric.

Arguments:
universes

a list of universes (MDAnalysis.Universe object), each containing a trajectory

reference

reference coordinates; MDAnalysis.Universe object; if None the first time step of the first item in trajs is used [None]

ref_select

The selection to operate on; can be one of:

  1. any valid selection string for selectAtoms() that produces identical selections in mobile and reference; or
  2. a dictionary {'mobile':sel1, 'reference':sel2} (the MDAnalysis.analysis.align.fasta2select() function returns such a dictionary based on a ClustalW_ or STAMP_ sequence alignment); or
  3. a tuple (sel1, sel2)

When using 2. or 3. with sel1 and sel2 then these selections can also each be a list of selection strings (to generate an AtomGroup with defined atom order as described under Ordered selections).

mass_weighted

do a mass-weighted RMSD fit [False]

tol_mass

Reject match if the atomic masses for matched atoms differ by more than tol_mass [0.1]

ref_frame

frame index to select frame from reference [0]

path_select

atom selection composing coordinates of (fitted) path; if None then path_select is set to ref_select [None]

targetdir

output files are saved there [.]

labels

list of strings, names of trajectories to be analyzed (MDAnalysis.Universe); if None, defaults to trajectory names [None]

universes

list of MDAnalysis.Universe objects containing trajectories

u_reference

MDAnalysis.Universe object containing a reference structure

ref_select

string, selection for selectAtoms() to select frame from PSA.u_reference

path_select

string, selection for selectAtoms() to select atoms to compose Path.path

ref_frame

int, frame index to select frame from Path.u_reference

filename

string, name of file to store calculated distance matrix (PSA.D)

paths

list of numpy.ndarray objects representing the set/ensemble of fitted trajectories

D

string, name of file to store calculated distance matrix (PSA.D)

cluster(distArray, method='ward', count_sort=False, distance_sort=False, no_labels=True, color_threshold=4)[source]

Cluster trajectories.

Arguments:
method

string, name of linkage criterion for clustering ['ward']

no_labels

boolean, if True then do not label dendrogram [True]

color_threshold

For brevity, let t be the color_threshold. Colors all the descendent links below a cluster node k the same color if k is the first node below the cut threshold t. All links connecting nodes with distances greater than or equal to the threshold are colored blue. If t is less than or equal to zero, all nodes are colored blue. If color_threshold is None or ‘default’, corresponding with MATLAB(TM) behavior, the threshold is set to 0.7*max(Z[:,2]). [4]]

Returns:

list of indices representing the row-wise order of the objects after clustering

generate_paths(**kwargs)[source]

Generate paths, aligning each to reference structure if necessary.

Keywords:
align

Align trajectories to atom selection PSA.ref_select of PSA.u_reference [False]

filename

strings representing base filename for fitted trajectories and paths [None]

infix

additional tag string that is inserted into the output filename of the fitted trajectory files [‘’]

mass_weighted

do a mass-weighted RMSD fit

tol_mass

Reject match if the atomic masses for matched atoms differ by more than tol_mass

ref_frame

frame index to select frame from reference

flat

represent Path.path as a 2D (\(N_p\times 3N\)) numpy.ndarray; if False then Path.path is a 3D (\(N_p\times N\times 3\)) numpy.ndarray [False]

save

boolean; if True, pickle list of names for fitted trajectories [True]

store

boolean; if True then writes each path (numpy.ndarray) in PSA.paths to compressed npz (numpy) files [False]

The fitted trajectories are written to new files in the “/trj_fit” subdirectory in PSA.targetdir named “filename(trajectory)XXX*infix*_psa”, where “XXX” is a number between 000 and 999; the extension of each file is the same as its original. Optionally, the trajectories can also be saved in numpy compressed npz format in the “/paths” subdirectory in PSA.targetdir for persistence and can be accessed as the attribute PSA.paths.

load()[source]

Load fitted paths specified by ‘psa_path-names.pkl’ in PSA.targetdir.

plot(filename=None, linkage='ward', count_sort=False, distance_sort=False, figsize=4.5, labelsize=12)[source]

Plot a clustered distance matrix using method linkage along with the corresponding dendrogram. Rows (and columns) are identified using the list of strings specified by PSA.labels.

Arguments:
filename

string, save figure to filename [None]

linkage

string, name of linkage criterion for clustering ['ward']

figsize

set the vertical size of plot in inches [6]

labelsize

set the font size for colorbar labels; font size for path labels on dendrogram default to 3 points smaller [12]

If filename is supplied then the figure is also written to file (the suffix determines the file type, e.g. pdf, png, eps, ...). All other keyword arguments are passed on to pylab.imshow().

run(**kwargs)[source]

Perform path similarity analysis on the trajectories to compute the distance matrix.

A number of parameters can be changed from the defaults. The result is stored as the array PSA.D.

Keywords:
metric

selection string specifying the path metric to measure pairwise distances among PSA.paths ['hausdorff']

start, stop, step

start and stop frame index with step size: analyze trajectory[start:stop:step] [None]

store

boolean; if True then writes PSA.D to text and compressed npz (numpy) files [True]

filename

string, filename to save PSA.D

save_paths(filename=None)[source]

Save fitted PSA.paths to numpy compressed npz files.

Arguments:
filename

string, specifies filename [None]

The data are saved with numpy.savez_compressed() in the directory specified by PSA.targetdir.

save_result(filename=None)[source]

Save distance matrix PSA.D to a numpy compressed npz file and text file.

Arguments:
filename

string, specifies filename [None]

The data are saved with numpy.savez_compressed() and numpy.savetxt() in the directory specified by PSA.targetdir.