3.1.5. Calculating path similarity — MDAnalysis.analysis.psa
¶
Author: | Sean Seyler |
---|---|
Year: | 2015 |
Copyright: | GNU Public License v3 |
New in version 0.10.0.
The module contains code to calculate the geometric similarity of trajectories using path metrics such as the Hausdorff or Fréchet distances [Seyler2015]. The path metrics are functions of two paths and return a nonnegative number, i.e., a distance. Two paths are identical if their distance is zero, and large distances indicate dissimilarity. Each path metric is a function of the individual points (e.g., coordinate snapshots) that comprise each path and, loosely speaking, identify the two points, one per path of a pair of paths, where the paths deviate the most. The distance between these points of maximal deviation is measured by the root mean square deviation (RMSD), i.e., to compute structural similarity.
One typically computes the pairwise similarity for an ensemble of paths to produce a symmetric distance matrix, which can be clustered to, at a glance, identify patterns in the trajectory data. To properly analyze a path ensemble, one must select a suitable reference structure to which all paths (each conformer in each path) will be universally aligned using the rotations determined by the best-fit rmsds. Distances between paths and their structures are then computed directly with no further alignment. This pre-processing step is necessary to preserve the metric properties of the Hausdorff and Fréchet metrics; using the best-fit rmsd on a pairwise basis does not generally preserve the triangle inequality.
See also
The PSAnalysisTutorial outlines a typical application of PSA to a set of trajectories, including doing proper alignment, performing distance comparisons, and generating heat map-dendrogram plots from hierarchical clustering.
References
[Seyler2015] | Sean L. Seyler, Avishek Kumar, Michael F. Thorpe, Oliver Beckstein. Path Similarity Analysis: a Method for Quantifying Macromolecular Pathways. arXiv:1505.04807 (2015). |
3.1.5.1. Helper functions and variables¶
The following global variables are used by the functions and classes in this module.
-
MDAnalysis.analysis.psa.
hausdorff_names
¶
-
MDAnalysis.analysis.psa.
frechet_names
¶
The following functions are used by the other functions in this module.
-
MDAnalysis.analysis.psa.
sqnorm
(v, axis=None)[source]¶ Compute the sum of squares of elements along specified axes.
Arguments: - v
numpy.ndarray
of coordinates- axis
None or int or tuple of ints, optional Axis or axes along which a sum is performed. The default (axis = None) is perform a sum over all the dimensions of the input array. axis may be negative, in which case it counts from the last to the first axis.
Returns: float, the sum of the squares of the elements of v along axis
3.1.5.2. Classes, methods, and functions¶
-
MDAnalysis.analysis.psa.
get_path_metric_func
(name)[source]¶ Selects a path metric function by name.
Arguments: - name
string, name of path metric
Returns: The path metric function specified by name (if found).
-
MDAnalysis.analysis.psa.
hausdorff
(P, Q, N=None)[source]¶ Calculate the Hausdorff distance between two paths.
P (Q) is a
numpy.ndarray
of \(N_p\) (\(N_q\)) time steps, \(N\) atoms, and \(3N\) coordinates (e.g.,MDAnalysis.core.AtomGroup.AtomGroup.coordinates()
). P (Q) has either shape \(N_p\times N\times 3\) (\(N_q\times N\times 3\)), or \(N_p\times(3N)\) (\(N_q\times(3N)\)) in flattened form.Arguments: - P
numpy.ndarray
representing a path- Q
numpy.ndarray
representing a path- N
int, number of atoms; if
None
then P and Q are each assumed to be a 3Dnumpy.ndarray
; else, they are 2D (flattened)
Returns: float, the Hausdorff distance between paths P and Q
- Example::
>>> u = Universe(PSF,DCD) >>> mid = len(u.trajectory)/2 >>> ca = u.selectAtoms('name CA') >>> P = numpy.array([ ... ca.positions for _ in u.trajectory[:mid:] ... ]) # first half of trajectory >>> Q = numpy.array([ ... ca.positions for _ in u.trajectory[mid::] ... ]) # second half of trajectory >>> hausdorff(P,Q) 4.7786639840135905 >>> hausdorff(P,Q[::-1]) # hausdorff distance w/ reversed 2nd trajectory 4.7786639840135905
-
MDAnalysis.analysis.psa.
discrete_frechet
(P, Q, N=None)[source]¶ Calculate the discrete Frechet distance between two paths.
P (Q) is a
numpy.ndarray
of \(N_p\) (\(N_q\)) time steps, \(N\) atoms, and \(3N\) coordinates (e.g.,MDAnalysis.core.AtomGroup.AtomGroup.coordinates()
). P (Q) has either shape \(N_p\times N\times 3\) (\(N_q\times N\times 3\)), or :\(N_p\times(3N)\) (\(N_q\times(3N)\)) in flattened form.Arguments: - P
numpy.ndarray
representing a path- Q
numpy.ndarray
representing a path- N
int, number of atoms; if
None
then P and Q are each assumed to be a 3Dnumpy.ndarray
; else, they are 2D (flattened)
Returns: float, the discrete Frechet distance between paths P and Q
- Example::
>>> u = Universe(PSF,DCD) >>> mid = len(u.trajectory)/2 >>> ca = u.selectAtoms('name CA') >>> P = numpy.array([ ... ca.positions for _ in u.trajectory[:mid:] ... ]) # first half of trajectory >>> Q = numpy.array([ ... ca.positions for _ in u.trajectory[mid::] ... ]) # second half of trajectory >>> discrete_frechet(P,Q) 4.7786639840135905 >>> discrete_frechet(P,Q[::-1]) # frechet distance w/ 2nd trj reversed 2nd 6.8429011177113832
-
class
MDAnalysis.analysis.psa.
Path
(universe, reference, ref_select='name CA', path_select='all', ref_frame=0)[source]¶ Pre-process a
MDAnalysis.Universe
object: (1) fit the trajectory to a reference structure, (2) convert fitted time series to anumpy.ndarray
representation ofPath.path
.The analysis is performed with
PSA.run()
and stores the result in thenumpy.ndarray
distance matrixPSA.D
.PSA.run()
also generates a fitted trajectory and path from alignment of the original trajectories to a reference structure.New in version 0.9.1.
Setting up trajectory alignment and fitted path generation.
Arguments: - universe
MDAnalysis.Universe
object containing a trajectory- reference
reference structure;
MDAnalysis.Universe
object; ifNone
then traj is used (uses the current time step of the object) [None
]- ref_select
The selection to operate on for rms fitting; can be one of:
- any valid selection string for
selectAtoms()
that produces identical selections in mobile and reference; or - a dictionary
{'mobile':sel1, 'reference':sel2}
(theMDAnalysis.analysis.align.fasta2select()
function returns such a dictionary based on a ClustalW_ or STAMP_ sequence alignment); or - a tuple
(sel1, sel2)
When using 2. or 3. with sel1 and sel2 then these selections can also each be a list of selection strings (to generate an AtomGroup with defined atom order as described under Ordered selections).
- any valid selection string for
- path_select
atom selection composing coordinates of (fitted) path; if
None
then path_select is set to ref_select [None
]
-
u_original
¶ MDAnalysis.Universe
object with a trajectory
-
u_reference
¶ MDAnalysis.Universe
object containing a reference structure
-
ref_select
¶ string, selection for
selectAtoms()
to select frame fromPath.u_reference
-
path_select
¶ string, selection for
selectAtoms()
to select atoms to composePath.path
-
ref_frame
¶ int, frame index to select frame from
Path.u_reference
-
u_fitted
¶ MDAnalysis.Universe
object with the fitted trajectory
-
path
¶ numpy.ndarray
object representation of the fitted trajectory
-
fit_to_reference
(filename=None, prefix='', postfix='_fit', rmsdfile=None, targetdir='.', mass_weighted=False, tol_mass=0.1)[source]¶ Align each trajectory frame to the reference structure with
MDAnalysis.analysis.align.rms_fit_trj()
.Arguments: - filename
file name for the RMS-fitted trajectory or pdb; defaults to the original trajectory filename (from
Path.u_original
) with prefix prepended- prefix
prefix for auto-generating the new output filename
- rmsdfile
file name for writing the RMSD time series [
None
]- mass_weighted
do a mass-weighted RMSD fit
- tol_mass
Reject match if the atomic masses for matched atoms differ by more than tol_mass [0.1]
Returns: MDAnalysis.Universe
object containing a fitted trajectory
-
run
(align=False, filename=None, postfix='_fit', rmsdfile=None, targetdir='.', mass_weighted=False, tol_mass=0.1, flat=False)[source]¶ Generate a path from a trajectory and reference structure, aligning to a reference structure if specified.
This is a convenience method to generate a fitted trajectory from an inputted universe (
Path.u_original
) and reference structure (Path.u_reference
).Path.fit_to_reference()
andPath.to_path()
are used consecutively to generate a new universe (Path.u_fitted
) containing the fitted trajectory along with the correspondingPath.path
represented as annumpy.ndarray
. The method returns a tuple of the topology name and new trajectory name, which can be fed directly into anMDAnalysis.Universe
object after unpacking the tuple using the*
operator, as inMDAnalysis.Universe(*(top_name, newtraj_name))
.Arguments: - align
Align trajectory to atom selection
Path.ref_select
ofPath.u_reference
. IfTrue
, a universe containing an aligned trajectory is produced withPath.fit_to_reference()
[False
]- filename
filename for the RMS-fitted trajectory or pdb; defaults to the original trajectory filename (from
Path.u_original
) with prefix prepended- prefix
prefix for auto-generating the new output filename
- rmsdfile
file name for writing the RMSD time series [
None
]- mass_weighted
do a mass-weighted RMSD fit
- tol_mass
Reject match if the atomic masses for matched atoms differ by more than tol_mass [0.1]
- flat
represent
Path.path
with a 2D (\(N_p\times 3N\))numpy.ndarray
; ifFalse
thenPath.path
is a 3D (\(N_p\times N\times 3\))numpy.ndarray
[False
]
Returns: A tuple of the topology name and new trajectory name.
-
to_path
(fitted=False, select=None, flat=False)[source]¶ Generates a coordinate time series from the fitted universe trajectory.
Given a selection of N atoms from select, the atomic positions for each frame in the fitted universe (
Path.u_fitted
) trajectory (with \(N_p\) total frames) are appended sequentially to form a 3D or 2D (if flat isTrue
)numpy.ndarray
representation of the fitted trajectory (with dimensions \(N_p\times N\times 3\) or \(N_p\times 3N\), respectively).Arguments: - fitted
construct a
Path.path
from thePath.u_fitted
trajectory; ifFalse
thenPath.path
is generated with the trajectory fromPath.u_original
[False
]- select
the selection for constructing the coordinates of each frame in
Path.path
; ifNone
thenPath.path_select
is used, else it is overridden by select [None
]- flat
represent
Path.path
as a 2D (\(N_p\times 3N\))numpy.ndarray
; ifFalse
thenPath.path
is a 3D (\(N_p\times N\times 3\))numpy.ndarray
[False
]
Returns: numpy.ndarray
representing a time series of atomic positions of anMDAnalysis.core.AtomGroup.AtomGroup
selection fromPath.u_fitted.trajectory
-
class
MDAnalysis.analysis.psa.
PSA
(universes, reference=None, ref_select='name CA', ref_frame=0, path_select=None, labels=None, targetdir=None)[source]¶ Perform Path Similarity Analysis (PSA) on a set of trajectories.
The analysis is performed with
PSA.run()
and stores the result in thenumpy.ndarray
distance matrixPSA.D
.PSA.run()
also generates a fitted trajectory and path from alignment of the original trajectories to a reference structure.New in version 0.8.
Setting up Path Similarity Analysis.
The mutual similarity between all unique pairs of trajectories are computed using a selected path metric.
Arguments: - universes
a list of universes (
MDAnalysis.Universe
object), each containing a trajectory- reference
reference coordinates;
MDAnalysis.Universe
object; ifNone
the first time step of the first item in trajs is used [None
]- ref_select
The selection to operate on; can be one of:
- any valid selection string for
selectAtoms()
that produces identical selections in mobile and reference; or - a dictionary
{'mobile':sel1, 'reference':sel2}
(theMDAnalysis.analysis.align.fasta2select()
function returns such a dictionary based on a ClustalW_ or STAMP_ sequence alignment); or - a tuple
(sel1, sel2)
When using 2. or 3. with sel1 and sel2 then these selections can also each be a list of selection strings (to generate an AtomGroup with defined atom order as described under Ordered selections).
- any valid selection string for
- mass_weighted
do a mass-weighted RMSD fit [
False
]- tol_mass
Reject match if the atomic masses for matched atoms differ by more than tol_mass [0.1]
- ref_frame
frame index to select frame from reference [0]
- path_select
atom selection composing coordinates of (fitted) path; if
None
then path_select is set to ref_select [None
]- targetdir
output files are saved there [.]
- labels
list of strings, names of trajectories to be analyzed (
MDAnalysis.Universe
); ifNone
, defaults to trajectory names [None
]
-
universes
¶ list of
MDAnalysis.Universe
objects containing trajectories
-
u_reference
¶ MDAnalysis.Universe
object containing a reference structure
-
ref_select
¶ string, selection for
selectAtoms()
to select frame fromPSA.u_reference
-
path_select
¶ string, selection for
selectAtoms()
to select atoms to composePath.path
-
ref_frame
¶ int, frame index to select frame from
Path.u_reference
-
paths
¶ list of
numpy.ndarray
objects representing the set/ensemble of fitted trajectories
-
cluster
(distArray, method='ward', count_sort=False, distance_sort=False, no_labels=True, color_threshold=4)[source]¶ Cluster trajectories.
Arguments: - method
string, name of linkage criterion for clustering [
'ward'
]- no_labels
boolean, if
True
then do not label dendrogram [True
]- color_threshold
For brevity, let t be the color_threshold. Colors all the descendent links below a cluster node k the same color if k is the first node below the cut threshold t. All links connecting nodes with distances greater than or equal to the threshold are colored blue. If t is less than or equal to zero, all nodes are colored blue. If color_threshold is None or ‘default’, corresponding with MATLAB(TM) behavior, the threshold is set to 0.7*max(Z[:,2]). [
4
]]
Returns: list of indices representing the row-wise order of the objects after clustering
-
generate_paths
(**kwargs)[source]¶ Generate paths, aligning each to reference structure if necessary.
Keywords: - align
Align trajectories to atom selection
PSA.ref_select
ofPSA.u_reference
[False
]- filename
strings representing base filename for fitted trajectories and paths [
None
]- infix
additional tag string that is inserted into the output filename of the fitted trajectory files [‘’]
- mass_weighted
do a mass-weighted RMSD fit
- tol_mass
Reject match if the atomic masses for matched atoms differ by more than tol_mass
- ref_frame
frame index to select frame from reference
- flat
represent
Path.path
as a 2D (\(N_p\times 3N\))numpy.ndarray
; ifFalse
thenPath.path
is a 3D (\(N_p\times N\times 3\))numpy.ndarray
[False
]- save
boolean; if
True
, pickle list of names for fitted trajectories [True
]- store
boolean; if
True
then writes each path (numpy.ndarray
) inPSA.paths
to compressed npz (numpy) files [False
]
The fitted trajectories are written to new files in the “/trj_fit” subdirectory in
PSA.targetdir
named “filename(trajectory)XXX*infix*_psa”, where “XXX” is a number between 000 and 999; the extension of each file is the same as its original. Optionally, the trajectories can also be saved in numpy compressed npz format in the “/paths” subdirectory inPSA.targetdir
for persistence and can be accessed as the attributePSA.paths
.
-
plot
(filename=None, linkage='ward', count_sort=False, distance_sort=False, figsize=4.5, labelsize=12)[source]¶ Plot a clustered distance matrix using method linkage along with the corresponding dendrogram. Rows (and columns) are identified using the list of strings specified by
PSA.labels
.Arguments: - filename
string, save figure to filename [
None
]- linkage
string, name of linkage criterion for clustering [
'ward'
]- figsize
set the vertical size of plot in inches [
6
]- labelsize
set the font size for colorbar labels; font size for path labels on dendrogram default to 3 points smaller [
12
]
If filename is supplied then the figure is also written to file (the suffix determines the file type, e.g. pdf, png, eps, ...). All other keyword arguments are passed on to
pylab.imshow()
.
-
run
(**kwargs)[source]¶ Perform path similarity analysis on the trajectories to compute the distance matrix.
A number of parameters can be changed from the defaults. The result is stored as the array
PSA.D
.Keywords: - metric
selection string specifying the path metric to measure pairwise distances among
PSA.paths
['hausdorff'
]- start, stop, step
start and stop frame index with step size: analyze
trajectory[start:stop:step]
[None
]- store
boolean; if
True
then writesPSA.D
to text and compressed npz (numpy) files [True
]- filename
string, filename to save
PSA.D
-
save_paths
(filename=None)[source]¶ Save fitted
PSA.paths
to numpy compressed npz files.Arguments: - filename
string, specifies filename [
None
]
The data are saved with
numpy.savez_compressed()
in the directory specified byPSA.targetdir
.
-
save_result
(filename=None)[source]¶ Save distance matrix
PSA.D
to a numpy compressed npz file and text file.Arguments: - filename
string, specifies filename [
None
]
The data are saved with
numpy.savez_compressed()
andnumpy.savetxt()
in the directory specified byPSA.targetdir
.