Coverage for lingpy/algorithm/clustering.py : 98%

Hot-keys on this page
r m x p toggle line displays
j k next/prev highlighted chunk
0 (zero) top of page
1 (one) first highlighted chunk
""" Module provides general clustering functions for LingPy. """
"No brackets, colons, and commas allowed for doculect names" )
""" Carry out a flat cluster analysis based on the UPGMA algorithm \ (:evobib:`Sokal1958`).
Parameters ----------
threshold : float The threshold which terminates the algorithm.
matrix : list A two-dimensional list containing the distances.
taxa : list (default=None) A list containing the names of the taxa. If the list is left empty, the indices of the taxa will be returned instead of their names.
Returns -------
clusters : dict A dictionary with cluster-IDs as keys and a list of the taxa corresponding to the respective ID as values.
Examples -------- The function is automatically imported along with LingPy.
>>> from lingpy import * >>> from lingpy.algorithm import squareform
Create a list of arbitrary taxa.
>>> taxa = ['German','Swedish','Icelandic','English','Dutch']
Create an arbitrary distance matrix.
>>> matrix = squareform([0.5,0.67,0.8,0.2,0.4,0.7,0.6,0.8,0.8,0.3]) >>> matrix [[0.0, 0.5, 0.67, 0.8, 0.2], [0.5, 0.0, 0.4, 0.7, 0.6], [0.67, 0.4, 0.0, 0.8, 0.8], [0.8, 0.7, 0.8, 0.0, 0.3], [0.2, 0.6, 0.8, 0.3, 0.0]]
Carry out the flat cluster analysis.
>>> flat_upgma(0.6,matrix,taxa) {0: ['German', 'Dutch', 'English'], 1: ['Swedish', 'Icelandic']}
See also -------- flat_cluster flat_upgma fuzzy link_clustering mcl
""" return cluster.flat_upgma(threshold, matrix, taxa or [], revert)
""" Carry out a flat cluster analysis based on linkage algorithms.
Parameters ---------- method : { "upgma", "single", "complete", "ward"} Select between 'ugpma', 'single', and 'complete'. You can also test "ward", but there's no guarantee that this is the correct algorithm.
threshold : float The threshold which terminates the algorithm.
matrix : list A two-dimensional list containing the distances.
taxa : list (default=None) A list containing the names of the taxa. If the list is left empty, the indices of the taxa will be returned instead of their names.
Returns -------
clusters : dict A dictionary with cluster-IDs as keys and a list of the taxa corresponding to the respective ID as values.
Examples -------- The function is automatically imported along with LingPy.
>>> from lingpy import * >>> from lingpy.algorithm import squareform
Create a list of arbitrary taxa.
>>> taxa = ['German','Swedish','Icelandic','English','Dutch']
Create an arbitrary distance matrix.
>>> matrix = squareform([0.5,0.67,0.8,0.2,0.4,0.7,0.6,0.8,0.8,0.3]) >>> matrix [[0.0, 0.5, 0.67, 0.8, 0.2], [0.5, 0.0, 0.4, 0.7, 0.6], [0.67, 0.4, 0.0, 0.8, 0.8], [0.8, 0.7, 0.8, 0.0, 0.3], [0.2, 0.6, 0.8, 0.3, 0.0]]
Carry out the flat cluster analysis.
>>> flat_cluster('upgma',0.6,matrix,taxa) {0: ['German', 'Dutch', 'English'], 1: ['Swedish', 'Icelandic']}
See also -------- flat_cluster flat_upgma fuzzy link_clustering mcl
"""
""" Carry out a cluster analysis based on the UPGMA algorithm \ (:evobib:`Sokal1958`).
Parameters ----------
matrix : list A two-dimensional list containing the distances.
taxa : list An list containing the names of all taxa corresponding to the distances in the matrix.
distances : bool (default=True) If set to **False**, only the topology of the tree will be returned.
Returns -------
newick : str A string in newick-format which can be further used in biological software packages to view and plot the tree.
Examples -------- Function is automatically imported when importing lingpy.
>>> from lingpy import * >>> from lingpy.algorithm import squareform
Create an arbitrary list of taxa.
>>> taxa = ['German','Swedish','Icelandic','English','Dutch']
Create an arbitrary matrix.
>>> matrix = squareform([0.5,0.67,0.8,0.2,0.4,0.7,0.6,0.8,0.8,0.3])
Carry out the cluster analysis.
>>> upgma(matrix,taxa,distances=False) '((Swedish,Icelandic),(English,(German,Dutch)));'
See also -------- neighbor
"""
""" Function clusters data according to the Neighbor-Joining algorithm \ (:evobib:`Saitou1987`).
Parameters ----------
matrix : list A two-dimensional list containing the distances.
taxa : list An list containing the names of all taxa corresponding to the distances in the matrix.
distances : bool (default=True) If set to **False**, only the topology of the tree will be returned.
Returns -------
newick : str A string in newick-format which can be further used in biological software packages to view and plot the tree.
Examples -------- Function is automatically imported when importing lingpy.
>>> from lingpy import * >>> from lingpy.algorithm import squareform
Create an arbitrary list of taxa.
>>> taxa = ['Norwegian','Swedish','Icelandic','Dutch','English']
Create an arbitrary matrix.
>>> matrix = squareform([0.5,0.67,0.8,0.2,0.4,0.7,0.6,0.8,0.8,0.3])
Carry out the cluster analysis.
>>> neighbor(matrix,taxa) '(((Norwegian,(Swedish,Icelandic)),English),Dutch);'
See also -------- upgma
"""
""" Create fuzzy cluster of a given distance matrix.
Parameters ---------- threshold : float The threshold that shall be used for the basic clustering of the data.
matrix : list A two-dimensional list containing the distances.
taxa : list An list containing the names of all taxa corresponding to the distances in the matrix.
method : { "upgma", "single", "complete" } (default="upgma") Select the method for the flat cluster analysis.
distances : bool If set to "False", only the topology of the tree will be returned.
revert : bool (default=False) Specify whether a reverted dictionary should be returned.
Returns ------- cluster : dict A dictionary with cluster-IDs as keys and a list as value, containing the taxa that are assigned to a given cluster-ID.
Examples -------- The function is automatically imported along with LingPy.
>>> from lingpy import * from lingpy.algorithm import squareform
Create a list of arbitrary taxa.
>>> taxa = ['German','Swedish','Icelandic','English','Dutch']
Create an arbitrary distance matrix.
>>> matrix = squareform([0.5,0.67,0.8,0.2,0.4,0.7,0.6,0.8,0.8,0.3]) >>> matrix [[0.0, 0.5, 0.67, 0.8, 0.2], [0.5, 0.0, 0.4, 0.7, 0.6], [0.67, 0.4, 0.0, 0.8, 0.8], [0.8, 0.7, 0.8, 0.0, 0.3], [0.2, 0.6, 0.8, 0.3, 0.0]]
Carry out the fuzzy flat cluster analysis.
>>> fuzzy(0.5,matrix,taxa) {1: ['Swedish', 'Icelandic'], 2: ['Dutch', 'German'], 3: ['Dutch', 'English']}
Notes ----- This is a very simple fuzzy clustering algorithm. It basically does nothing else than removing taxa successively from the matrix, flat-clustering the remaining taxa with the corresponding threshold, and then returning a combined "consensus" cluster in which taxa may be assigned to multiple clusters.
See also -------- link_clustering
"""
method, threshold, new_matrix, [t for t in taxa if t != taxon])
else:
""" Calculate a tree of a given distance matrix.
Parameters ---------- matrix : list The distance matrix to be used. taxa : list A list of the taxa in the distance matrix. tree_calc : str (default="neighbor") The method for tree calculation that shall be used. Select between:
* "neighbor": Neighbor-joining method (:evobib:`Saitou1987`) * "upgma" : UPGMA method (:evobib:`Sokal1958`)
distances : bool (default=True) If set to c{True}, distances will be included in the tree-representation. filename : str (default='') If a filename is specified, the data will be written to that file.
Returns ------- tree : ~lingpy.thirdparty.cogent.tree.PhyloNode A ~lingpy.thirdparty.cogent.tree.PhyloNode object for handling tree files. """
else:
""" Calculate flat cluster of distance matrix.
Parameters ---------- threshold : float The threshold to be used for the calculation. matrix : list The distance matrix to be used. taxa : list A list of the taxa in the distance matrix. cluster_method : {"upgma", "mcl", "single", "complete"} (default="upgma")
Returns ------- groups : dict A dictionary with the taxa as keys and the group assignment as values.
Notes ----- This function is important for internal calculations within wordlist. It is not recommended for further use. """
cluster_method, threshold, matrix, taxa=[t for t in taxa]) else: # cluster_method in ['mcl', 'markov']:
else:
""" Get weighted average degree. """
""" Use a variant of the method by :evobib:`Apeltsin2011` in order to find an optimal threshold.
Parameters ---------- matrix : list The distance matrix for which the threshold shall be determined. thresholds : list (default=[i*0.05 for i in range(1,19)[::-1]) The range of thresholds that shall be tested. logs : {bool,builtins.function} (default=True) If set to **True**, the logarithm of the score beyond the threshold will be assigned as weight to the graph. If set to c{False} all weights will be set to 1. Use a custom function to define individual ways to calculate the weights.
Returns ------- threshold : {float,None} If a float is returned, this is the threshold identified by the method. If **None** is returned, no threshold could be identified.
Notes ----- This is a very simple method that may not work well depending on the dataset. So we recommend to use it with great care. """
# get the old degree of the matrix
# store the plateaus (where nothing changes in the network)
# this is the current index of the last plateau
# start iterating and calculating # get the new degree of the matrix under threshold t
# if there is a new degree # get the change in comparison with the old degree
# swap old degree to new degree
# if there's a plateau, the changed degree should be equal or # greater zero else:
# try to find the plateau of maximal length # check if first entry is NOT of length 1 for t in sorted_plato if len(plato[t]) > 1][0]
threshold, matrix, taxa, link_threshold=False, revert=False, matrix_type="distances", fuzzy=True): """ Carry out a link clustering analysis using the method by :evobib:`Ahn2010`.
Parameters ---------- threshold : {float, bool} The threshold that shall be used for the initial selection of links assigned to the data. If set to c{False}, the weights from the matrix will be used directly.
matrix : list A two-dimensional list containing the distances.
taxa : list An list containing the names of all taxa corresponding to the distances in the matrix.
link_threshold : float (default=0.5) The threshold that shall be used for the internal clustering of the data.
matrix_type : {"distances","similarities","weights"} (default="distances") Specify the type of the matrix. If the matrix contains distance data, it will be adapted to similarity data. If it contains "similarities", no adaptation is needed. If it contains "weights", a weighted version of link clustering (see the supplementary in :evobib:`Ahn2010` for details) ]will be carried out.
Returns ------- cluster : dict A dictionary with cluster-IDs as keys and a list as value, containing the taxa that are assigned to a given cluster-ID.
Examples --------
The function is automatically imported along with LingPy.
>>> from lingpy import * >>> from lingpy.algorithm import squareform
Create a list of arbitrary taxa.
>>> taxa = ['German','Swedish','Icelandic','English','Dutch']
Create an arbitrary distance matrix.
>>> matrix = squareform([0.5,0.67,0.8,0.2,0.4,0.7,0.6,0.8,0.8,0.3]) >>> matrix [[0.0, 0.5, 0.67, 0.8, 0.2], [0.5, 0.0, 0.4, 0.7, 0.6], [0.67, 0.4, 0.0, 0.8, 0.8], [0.8, 0.7, 0.8, 0.0, 0.3], [0.2, 0.6, 0.8, 0.3, 0.0]]
Carry out the link-clustering analysis.
>>> link_clustering(0.5,matrix,taxa) {1: ['Dutch', 'English', 'German'], 2: ['Icelandic', 'Swedish']}
See also -------- fuzzy
""" # check for matrix type else:
# get the edges and the adjacency from the thresholds
# initialize the HLC object else: # check for null edges: if they occur, return the clusters directly return {a: [b] for a, b in zip(taxa, range(len(taxa)))} else: else: return {a: [b] for a, b in zip(range(len(taxa)), taxa)} else:
# carry out the analyses using defaults for the clustering
# retrieve all clusterings for the nodes # retrieve the data
# count the links of
# delete all clusters that appear as subsets of larger clusters
# renumber the data
# determine weights for communities to edges
# revert stuff first
# weight membership of nodes and assign to most prominent community clr, key=lambda x: node_weights[t][x] if x in node_weights[t] else 0, reverse=True)
# the following lines of code are devoted to mcl clustering algorithm
""" Normalize the matrix. """
""" Check whether the matrix is idempotent. """
""" Look for attracting nodes in the matrix. """
# make a converter for length
threshold, matrix, taxa, max_steps=1000, inflation=2, expansion=2, add_self_loops=True, revert=False, logs=True, matrix_type="distances"): """ Carry out a clustering using the MCL algorithm (:evobib:`Dongen2000`).
Parameters ---------- threshold : {float, bool} The threshold that shall be used for the initial selection of links assigned to the data. If set to c{False}, the weights from the matrix will be used directly.
matrix : list A two-dimensional list containing the distances.
taxa : list An list containing the names of all taxa corresponding to the distances in the matrix.
max_steps : int (default=1000) Maximal number of iterations.
inflation : int (default=2) Inflation parameter for the MCL algorithm.
expansion : int (default=2) Expansion parameter of the MCL algorithm.
add_self_loops : {True, False, builtins.function} (default=True) Determine whether self-loops should be added, and if so, how they should be weighted. If a function for the calculation of self-loops is given, it will take the whole column of the matrix for each taxon as input.
logs : { bool, function } (default=True) If set to c{True}, the logarithm of the score beyond the threshold will be assigned as weight to the graph. If set to c{False} all weights will be set to 1. Use a custom function to define individual ways to calculate the weights.
matrix_type : { "distances", "similarities" } Specify the type of the matrix. If the matrix contains distance data, it will be adapted to similarity data. If it contains "similarities", no adaptation is needed.
Examples --------
The function is automatically imported along with LingPy.
>>> from lingpy import * >>> from lingpy.algorithm import squareform
Create a list of arbitrary taxa.
>>> taxa = ['German','Swedish','Icelandic','English','Dutch']
Create an arbitrary distance matrix.
>>> matrix = squareform([0.5,0.67,0.8,0.2,0.4,0.7,0.6,0.8,0.8,0.3]) >>> matrix [[0.0, 0.5, 0.67, 0.8, 0.2], [0.5, 0.0, 0.4, 0.7, 0.6], [0.67, 0.4, 0.0, 0.8, 0.8], [0.8, 0.7, 0.8, 0.0, 0.3], [0.2, 0.6, 0.8, 0.3, 0.0]]
Carry out the link-clustering analysis.
>>> mcl(0.5,matrix,taxa) {1: ['German', 'English', 'Dutch'], 2: ['Swedish', 'Icelandic']}
""" # check for type of matrix else:
# check for matrix type and decide how to handle logs logs = lambda x: x else: else:
# check for threshold
# check for self_loops else:
# normalize the matrix
# start looping and the like # expansion
# inflation
# normalization
# increase steps
# check for matrix convergence
# retrieve the clusters
# modify clusters
""" Calculate partition density for a given threshold on a distance matrix.
Notes ----- See :evobib:`Ahn2012` for details on the calculation of partition density in a given network. """
# compute cutoff for matrix at t
# get the total number of links
# get connected components
else: # most complicated, update all the stuff
# determine best idx this = parts[j] other = parts[i] else: other = parts[j]
# find all neighbors of the
# finish unconnected components
# convert to dictionary
# return zero, if all components are different
# count density
# get nodes
# get edges
# calculate sum formula
""" Calculate the best threshold by maximizing partition density for a given range of thresholds.
Notes ----- This method makes use of the idea of partition density proposed in :evobib:`Ahn2010`.
"""
# strip off the hightes values from the end else:
else:
else: |