Coverage for lingpy/compare/lexstat.py : 99%

Hot-keys on this page
r m x p toggle line displays
j k next/prev highlighted chunk
0 (zero) top of page
1 (one) first highlighted chunk
# *-* coding: utf-8 *-*
ipa2tokens, tokens2class, prosodic_string, prosodic_weights, class2tokens, check_tokens )
"""Generator for error reports on token strings.
:param key_and_tokens: iterator over (key, token_string) pairs. """ else: stress=stress): line[idx]), line)
# a full charstring # a reduced charstring
char_from_charstring(charA), char_from_charstring(charB))
""" Basic class for automatic cognate detection.
Parameters ---------- filename : str The name of the file that shall be loaded. model : :py:class:`~lingpy.data.model.Model` The sound-class model that shall be used for the analysis. Defaults to the SCA sound-class model. merge_vowels : bool (default=True) Indicate whether consecutive vowels should be merged into single tokens or kept apart as separate tokens. transform : dict A dictionary that indicates how prosodic strings should be simplified (or generally transformed), using a simple key-value structure with the key referring to the original prosodic context and the value to the new value. Currently, prosodic strings (see :py:meth:`~lingpy.sequence.sound_classes.prosodic_string`) offer 11 different prosodic contexts. Since not all these are helpful in preliminary analyses for cognate detection, it is useful to merge some of these contexts into one. The default settings distinguish only 5 instead of 11 available contexts, namely:
* ``C`` for all consonants in prosodically ascending position, * ``c`` for all consonants in prosodically descending position, * ``V`` for all vowels, * ``T`` for all tones, and * ``_`` for word-breaks.
Make sure to check also the "vowel" keyword when initialising a LexStat object, since the symbols you use for vowels and tones should be identical with the ones you define in your transform dictionary. vowels : str (default="VT\_") For scoring function creation using the :py:class:`~lingpy.compare.lexstat.LexStat.get_scorer` function, you have the possibility to use reduced scores for the matching of tones and vowels by modifying the "vscale" parameter, which is set to 0.5 as a default. In order to make sure that vowels and tones are properly detected, make sure your prosodic string representation of vowels matches the one in this keyword. Thus, if you change the prosodic strings using the "transform" keyword, you also need to change the vowel string, to make sure that "vscale" works as wanted in the :py:class:`~lingpy.compare.lexstat.LexStat.get_scorer` function. check : bool (default=False) If set to **True**, the input file will first be checked for errors before the calculation is carried out. Errors will be written to the file **errors**, defaulting to ``errors.log``. See also ``apply_checks`` apply_checks : bool (default=False) If set to **True**, any errors identified by `check` will be handled silently. no_bscorer: bool (default=False) If set to **True**, this will suppress the creation of a language-specific scoring function (which may become quite large and is additional ballast if the method "lexstat" is not used after all. If you use the "lexstat" method, however, this needs to be set to **False**. errors : str The name of the error log. segments : str (default="tokens") The name of the column in your data which contains the segmented transcriptions, or in which the segmented transcriptions should be placed. transcription : str (default="ipa") The name of the column in your data which contains the unsegmented transcriptions. classes : str (default="classes") The name of the column in the data which contains the sound class representation of the transcriptions, or in which this information shall be placed after automatic conversion. numbers : str (default="numbers") The language-specific triples consisting of language id (numeric), sound class string (one character only), and prosodic string (one character only). Usually, numbers are automatically created from the columns "classes", "prostrings", and "langid", but you can also provide them in your data. langid : str (default="langid") Name of the column that contains a numerical language identifier, needed to produce the language-specific character triples ("numbers"). Unless specific explicitly, this is automatically created. prostrings : str (default="prostrings") Name of the column containing prosodic strings (see :evobib:`List2014d` for more details) of the segmented transcriptions, containing one character per prosodic string. Prostrings add a contextual component to phonetic sequences. They are automatically created, but can likewise be submitted from the initial data. weights : str (default="weights") The name of the column which stores the individual gap-weights for each sequence. Gap weights are positive floats for each segment in a string, which modify the gap opening penalty during alignment. tokenize : function (default=ipa2tokens) The function which should be used to tokenize the entries in the column storing the transcriptions in case no segmentation is provided by the user. get_prostring : function (default=prosodic_string) The function which should be used to create prosodic strings from the segmented transcription data. If you want to completely ignore prosodic strings in LexStat calculations, you could just pass the following function::
>>> lex = LexStat('inputfile.tsv', get_prostring=lambda x: ["x" for y in x])
Attributes ---------- pairs : dict A dictionary with tuples of language names as key and indices as value, pointing to unique combinations of words with the same meaning in all language pairs. model : :py:class:`~lingpy.data.model.Model` The sound class model instance which serves to convert the phonetic data into sound classes. chars : list A list of all unique language-specific character types in the instantiated LexStat object. The characters in this list consist of
* the language identifier (numeric, referenced as "langid" as a default, but customizable via the keyword "langid") * the sound class symbol for the respective IPA transcription value * the prosodic class value
All values are represented in the above order as one string, separated by a dot. Gaps are also included in this collection. They are traditionally represented as "X" for the sound class and "-" for the prosodic string. rchars : list A list containing all unique character types across languages. In contrast to the chars-attribute, the "rchars" (raw chars) do not contain the languahttp://tsv.lingpy.org/triples/get_data.py?history=true&remote_dbase=sinotibetan.sqlite3&limit=ge identifier, thus they only consist of two values, separated by a dot, namely, the sound class symbol, and the prosodic class value. scorer : dict A collection of :py:class:`~lingpy.algorithm.cython.misc.ScoreDict` objects, which are used to score the strings. LexStat distinguishes two different scoring functions:
* rscorer: A "raw" scorer that is not language-specific and consists only of sound class values and prosodic string values. This scorer is traditionally used to carry out the first alignment in order to calculate the language-specific scorer. It is directly accessible as an attribute of the LexStat class (:py:class:`~lingpy.compare.lexstat.lexstat.rscorer`). The characters which constitute the values in this scorer are accessible via the "rchars" attribue of each lexstat class. * bscorer: The language-specific scorer. This scorer is made of unique language-specific characters. These are accessible via the "chars" attribute of each LexStat class. As the "rscorer", the "bscorer" can also be accessed directly as an attribute of the LexStat class (:py:class:`~lingpy.compare.lexstat.lexstat.bscorer`).
Notes ----- Instantiating this class does not require a lot of parameters. However, the user may modify its behaviour by providing additional attributes in the input file.
""" "model": rcParams['sca'], "merge_vowels": rcParams['merge_vowels'], 'transform': rcParams['lexstat_transform'], "check": False, "apply_checks": False, "defaults": False, "no_bscorer": False, "errors": "errors.log", "expand_nasals": False, "segments": "tokens", "numbers": "numbers", "classes": "classes", "transcription": "ipa", "prostrings": "prostrings", "weights": "weights", "sonars": "sonars", "langid": "langid", "duplicates": "duplicates", "tokenize": ipa2tokens, "get_prostring": prosodic_string, "row": "concept", "col": "doculect", "conf": None, 'cldf': False }
# make segments, numbers and classes persistent across classes
else:
# set the lexstat stamp util.PROG)
# initialize the wordlist self, filename, row=kw['row'], col=kw['col'], conf=kw['conf']) self._transcription in self.header
# create tokens if they are missing self._segments, self._transcription, kw['tokenize'], merge_vowels=kw['merge_vowels'], expand_nasals=kw['expand_nasals'])
# add a debug procedure for tokens [(key, self[key, self._segments]) for key in self], cldf=kw['cldf'])) key, msg, ' '.join(line)))
"There were errors in the input data - exclude them?"): 'tsv', filename=self.filename + '_cleaned', subset=True, rows={"ID": "not in " + str([i[0] for i in errors])}) # load the data in a new LexStat instance # and copy the __dict__ else:
# sonority profiles self._sonars, self._segments, lambda x: [int(i) for i in tokens2class( x, rcParams['art'], stress=rcParams['stress'], cldf=self._cldf)]) self._prostrings, self._sonars, lambda x: kw['get_prostring'](x)) # get sound class strings self._classes, self._segments, lambda x: ''.join(tokens2class(x, kw["model"], cldf=self._cldf, stress=rcParams['stress']))) # create IDs for the languages self.cols, [str(i + 1) for i in range(self.width)])) self._langid, self._col_name, lambda x: transform[x]) # get the numbers for all strings # change the discriminative potential of the sound-class string # tuples, note that this is still wip, we have to tweak around with # this in order to find an optimum for the calculation self._numbers, self._langid + ',' + self._classes + ',' + self._prostrings, lambda x, y: [charstring(x[y[0]], a, self._transform[b]) for a, b in zip(x[y[1]], x[y[2]])]) # check for weights self._weights, self._prostrings, lambda x: prosodic_weights(x)) # check for duplicates # first, check for item 'words' in data, if not given, create it self._transcription, self._segments, lambda x: ''.join(x)) # add information regarding vowels in the data based on the # transformation, which is important for the calculation of the # v-scale in lexstat.get_scorer self._transform[v] for v in 'XYZT_']))) \ if hasattr(self, '_transform') else 'VT_'
# create an index col=taxon, entry=self._numbers, flat=True):
set(char.split('.', 1)[1] for char in self.chars)) + [charstring(i + 1) for i in range(self.width)] rcParams['lexstat_bad_chars_limit']: "{0:.0f}% of the unique characters in your word " "list are not " "recognized by {1}. You should set check=True!".format( 100 * len(self.bad_chars) / len(self.chars), util.PROG))
# create a scoring dictionary self.chars, self.model) self.rchars, self.model)
# make the language pairs enumerate(self.cols)): ''.join(self[idxA, self._segments]), taxonA, ''.join(self[idxB, self._segments]), taxonB )
""" Method allows quick access to the data by passing the integer key.
Notes ----- In contrast to the basic wordlist and parser classes, the LexStat wordlist further allows to access item pairs by passing a tuple consisting of two pairs of an index with its corresponding column name.
Examples -------- Load LingPy and the test_data function to get access to the test data files shipped with LingPy:: >>> from lingpy import * >>> from lingpy.tests.util import test_data
Instantiate a LexStat object:: >>> lex = LexStat(test_data('KSL.qlc'))
Retrieve the IPA column for line 1 and line2 in the data:
>>> lex[(1,'ipa'), (2, 'ipa')] """
self._data[idx[0][0]][self._header[self._alias[idx[1]]]], self._data[idx[0][1]][self._header[self._alias[idx[1]]]])
"""Helper method defines how words are aligned to retrieve distance \ scores""" self[x, self._numbers], self[y, self._numbers], [self.cscorer[charstring(self[y, 'langid']), n] for n in self[x, self._numbers]], [self.cscorer[charstring(self[x, 'langid']), n] for n in self[y, self._numbers]], self[x, self._prostrings], self[y, self._prostrings], 1, kw['scale'], kw['factor'], self.cscorer, kw['mode'], kw['restricted_chars'], 1)[2]
[n.split('.', 1)[1] for n in self[x, self._numbers]], [n.split('.', 1)[1] for n in self[y, self._numbers]], self[x, self._weights], self[y, self._weights], self[x, self._prostrings], self[y, self._prostrings], kw['gop'], kw['scale'], kw['factor'], self.rscorer, kw['mode'], kw['restricted_chars'], 1)[2]
self[x, entry], self[y, entry], True, kw['restriction'])
self[x, self._segments], self[y, self._segments])
self[x, 'user_tokens'], self[y, 'user_tokens'], kw['gop'], kw['scale'], kw['external_scorer'], 'overlap', True)[2]
('lexstat', 'sca', 'edit-dist', 'turchin', 'custom'), ( lexstat_align, sca_align, edit_align, turchin_align, custom_align)))[method]
"""Helper method for alignment operations""" x, y, method=method, distance=True, return_distance=kw['return_distance'], pprint=False, mode=kw['mode'], scale=kw['scale'], factor=kw['factor'], gop=kw['gop'])
self[x, self._segments], self[y, self._segments], normalized=kw['normalized'])
('lexstat', 'sca', 'edit-dist'), (base_align, base_align, edit_align)))[method]
"""Helper method for flat clustering in cognate detection."""
y, x, list(range(len(x))), max_steps=kw['max_steps'], inflation=kw['inflation'], expansion=kw['expansion'], add_self_loops=kw['add_self_loops'], logs=kw['mcl_logs'], revert=True)
y, x, list(range(len(x))), revert=True)
y, x, list(range(len(x))), revert=True, fuzzy=False, matrix_type=kw['matrix_type'], link_threshold=kw['link_threshold'])
( 'single', 'upgma', 'complete', 'ward', 'mcl', 'infomap', 'link_clustering'), ( linkage, linkage, linkage, linkage, mcl, infomap, lc)))[method]
""" Function creates a specific subset of all word pairs.
Parameters ---------- sublist : list A list which contains those items which should be considered for the subset creation, for example, a list of concepts. ref : string (default="concept") The reference point to compare the given sublist.
Notes ----- This function can be used to consider only a smaller part of word pairs when creating a scorer. Normally, all words are compared, but defining a subset allows to compare only those belonging to a specific concept list (Swadesh list). """ pair for pair in self.pairs[tA, tB] if self[ pair, ref][0] in sublist]
""" Use alignments to get a correspondences statistics. """ cluster_method='upgma', factor=rcParams['align_factor'], gop=rcParams['align_gop'], modes=rcParams['lexstat_modes'], preprocessing=False, preprocessing_method=rcParams['lexstat_preprocessing_method'], preprocessing_threshold=rcParams[ 'lexstat_preprocessing_threshold'], ref='scaid', restricted_chars=rcParams['restricted_chars'], threshold=rcParams['lexstat_scoring_threshold'], subset=False)
method=kw['preprocessing_method'], threshold=kw['preprocessing_threshold'], gop=kw['gop'], cluster_method=kw['cluster_method'], ref=kw['ref'])
desc='CORRESPONDENCE CALCULATION', total=self.width ** 2 / 2) as pb: enumerate(self.cols)): tA, tB))
pair for pair in pairs if pair in self.subsets[tA, tB]]
# threshold and preprocessing, make sure threshold is # different from pre-processing threshold when # preprocessing is set to false if self[pair, kw['ref']][0] == self[ pair, kw['ref']][1]] threshold = 10.0 else:
threshold, [self[pair, self._numbers] for pair in pairs], [self[pair, self._weights] for pair in pairs], [self[pair, self._prostrings] for pair in pairs], gop, scale, kw['factor'], self.bscorer, mode, kw['restricted_chars'])
# change representation of gaps # XXX check for bias XXX
""" Return the aligned results of randomly aligned sequences. """ modes=rcParams['lexstat_modes'], factor=rcParams['align_factor'], restricted_chars=rcParams['restricted_chars'], runs=rcParams['lexstat_runs'], rands=rcParams['lexstat_rands'], limit=rcParams['lexstat_limit'], method=rcParams['lexstat_scoring_method'])
# determine the mode else 'shuffle'
# get a random distribution for all pairs [(i, j) for i in range(kw['rands']) for j in range(kw['rands'])], kw['runs'])
desc='SEQUENCE GENERATION', total=len(self.cols)) as progress: col=taxon, entry=self._prostrings, flat=True) else: "Could not generate enough distinct words for" " the random distribution. " "Will expand automatically")
cldf=self._cldf) '{0}.{1}'.format(c, p) for c, p in zip( cls, [self._transform[pr] for pr in pros[taxon][-1]] )])
desc='RANDOM CORRESPONDENCE CALCULATION', total=tasks) as progress: enumerate(self.cols)): "Calculating random alignments" " for pair {0}/{1}.".format(tA, tB) ) 10.0, [(seqs[tA][x], seqs[tB][y]) for x, y in sample], [(weights[tA][x], weights[tB][y]) for x, y in sample], [(pros[tA][x], pros[tB][y]) for x, y in sample], gop, scale, kw['factor'], self.rscorer, mode, kw['restricted_chars'])
# change representation of gaps # get the correspondence count # XXX check XXX * len(self.pairs[tA,tB]) / runs
# check for gaps
# use shuffle approach otherwise else: desc='RANDOM CORRESPONDENCE CALCULATION', total=tasks) as progress: enumerate(self.cols)): "Calculating random alignments" "for pair {0}/{1}.".format(tA, tB) )
# get the number pairs etc. self[pair, self._numbers] for pair in self.pairs[tA, tB]] self[pair, self._weights] for pair in self.pairs[tA, tB]] self[pair, self._prostrings] for pair in self.pairs[tA, tB]] (x, y) for x in range(len(numbers)) for y in range(len(numbers))]
10.0, [( numbers[s[0]][0], numbers[s[1]][1]) for s in sample], [(gops[s[0]][0], gops[s[1]][1]) for s in sample], [( prostrings[s[0]][0], prostrings[s[1]][1]) for s in sample], gop, scale, kw['factor'], self.bscorer, mode, kw['restricted_chars'])
# change representation of gaps # get the correspondence count # XXX check XXX* len(self.pairs[tA,tB]) / runs
# check for gaps
""" Create a scoring function based on sound correspondences.
Parameters ---------- method : str (default='shuffle') Select between "markov", for automatically generated random strings, and "shuffle", for random strings taken directly from the data. ratio : tuple (default=3,2) Define the ratio between derived and original score for sound-matches. vscale : float (default=0.5) Define a scaling factor for vowels, in order to decrease their score in the calculations. runs : int (default=1000) Choose the number of random runs that shall be made in order to derive the random distribution. threshold : float (default=0.7) The threshold which used to select those words that are compared in order to derive the attested distribution. modes : list (default = [("global",-2,0.5),("local",-1,0.5)]) The modes which are used in order to derive the distributions from pairwise alignments. factor : float (default=0.3) The scaling factor for sound segments with identical prosodic environment. force : bool (default=False) Force recalculation of existing distribution. preprocessing: bool (default=False) Select whether SCA-analysis shall be used to derive a preliminary set of cognates from which the attested distribution shall be derived. rands : int (default=1000) If "method" is set to "markov", this parameter defines the number of strings to produce for the calculation of the random distribution. limit : int (default=10000) If "method" is set to "markov", this parameter defines the limit above which no more search for unique strings will be carried out. cluster_method : {"upgma" "single" "complete"} (default="upgma") Select the method to be used for the calculation of cognates in the preprocessing phase, if "preprocessing" is set to c{True}. gop : int (default=-2) If "preprocessing" is selected, define the gap opening penalty for the preprocessing calculation of cognates. unattested : {int, float} (default=-5) If a pair of sounds is not attested in the data, but expected by the alignment algorithm that computes the expected distribution, the score would be -infinity. Yet in order to allow to smooth this behaviour and to reduce the strictness, we set a default negative value which does not necessarily need to be too high, since it may well be that we miss a potentially good pairing in the first runs of alignment analyses. Use this keyword to adjust this parameter. unexpected : {int, float} (default=0.000001) If a pair is encountered in a given alignment but not expected according to the randomized alignments, the score would be not calculable, since we had to divide by zero. For this reason, we set a very small constant, by which the score is divided in this case. Not that this constant is only relevant in those cases where the shuffling procedure was not carried out long enough.
""" method=rcParams['lexstat_scoring_method'], ratio=rcParams['lexstat_ratio'], vscale=rcParams['lexstat_vscale'], runs=rcParams['lexstat_runs'], threshold=rcParams['lexstat_scoring_threshold'], modes=rcParams['lexstat_modes'], factor=rcParams['align_factor'], restricted_chars=rcParams['restricted_chars'], force=False, preprocessing=False, rands=rcParams['lexstat_rands'], limit=rcParams['lexstat_limit'], cluster_method=rcParams['lexstat_cluster_method'], gop=rcParams['align_gop'], preprocessing_threshold=rcParams[ 'lexstat_preprocessing_threshold'], preprocessing_method=rcParams['lexstat_preprocessing_method'], subset=False, defaults=False, unattested=-5, unexpected=0.00001 )
# get parameters and store them in string ratio=kw['ratio'], vscale=kw['vscale'], runs=kw['runs'], scoring_threshold=kw['threshold'], preprocessing_threshold=kw['preprocessing_threshold'], modestring=':'.join( '{0}-{1}-{2:.2f}'.format(a, abs(b), c) for a, b, c in kw['modes']), factor=kw['factor'], restricted_chars=kw['restricted_chars'], method=kw['method'], preprocessing='{0}:{1}:{2}'.format( kw['preprocessing'], kw['cluster_method'], kw['gop']), unattested=kw['unattested'], unexpected=kw['unexpected'] )
[ '{ratio[0]}:{ratio[1]}' '{vscale:.2f}', '{runs}', '{scoring_threshold:.2f}', '{modestring}', '{factor:.2f}', '{restricted_chars}', '{method}', '{preprocessing}', '{preprocessing_threshold}' '{unexpected:.2f}' '{unattested:.2f}' ]).format(**params)
# check for existing attributes "An identical scoring function has already been calculated, " "force recalculation by setting 'force' to 'True'.")
# check for attribute "An identical scoring function has already been " "calculated, force recalculation by setting 'force'" " to 'True'.") else: "A different scoring function has already been calculated, " "overwriting previous settings.")
# store parameters
# get the correspondence distribution # get the random distribution
# get the average gop
# create the new scoring matrix
list(self.freqs[tA]) + [charstring(i + 1)], list(self.freqs[tB]) + [charstring(j + 1)] ): (tA, tB), {}).get((charA, charB), False) (tA, tB), {}).get((charA, charB), False) # in the following we follow the former lexstat protocol
else: # elif not exp and not att:
# combine the scores else:
# get the real score / sum(kw['ratio'])
# use the vowel scale else:
""" Align all or some words of a given pair of languages.
Parameters ---------- idxA,idxB : {int, str} Use an integer to refer to the words by their unique internal ID, use language names to select all words for a given language. method : {'lexstat','sca'} Define the method to be used for the alignment of the words. mode : {'global','local','overlap','dialign'} (default='overlap') Select the mode for the alignment analysis. gop : int (default=-2) If 'sca' is selected as a method, define the gap opening penalty. scale : float (default=0.5) Select the scale for the gap extension penalty. factor : float (default=0.3) Select the factor for extra scores for identical prosodic segments. restricted_chars : str (default="T\_") Select the restricted chars (boundary markers) in the prosodic strings in order to enable secondary alignment. distance : bool (default=True) If set to c{True}, return the distance instead of the similarity score. pprint : bool (default=True) If set to c{True}, print the results to the terminal. return_distance : bool (default=False) If set to c{True}, return the distance score, otherwise, nothing will be returned. """ method='lexstat', mode="overlap", scale=0.5, factor=0.3, restricted_chars='_T', pprint=True, return_distance=False, gop=-2, distance=True, defaults=False, return_raw=False )
self.get_dict(col=idxA[0])[idxA[1]], self.get_dict(col=idxB[0])[idxB[1]], ): else: else: (idxA, concept), (idxB, concept), concept=None, **kw)
# assign the distance value
# get the language ids
idxA, self._numbers]] idxB, self._numbers]] else:
self[idxA, self._numbers], self[idxB, self._numbers], weightsA, weightsB, self[idxA, self._prostrings], self[idxB, self._prostrings], gop, kw['scale'], kw['factor'], scorer, kw['mode'], kw['restricted_chars'], distance)
# get a string of scores else:
scorer[a, b]) for a, b in zip(scoreA, scoreB)]
else:
self, concept=False, method='sca', scale=0.5, factor=0.3, restricted_chars='_T', mode='overlap', gop=-2, restriction='', **keywords): """ Calculate alignment matrices.
Notes ----- This is an iterator object and it yields the indices of a given concept, the matrix, and the concept. """ # currently, there are no defaults XXX defaults=False, external_scorer=False, # external scoring function ) method, scale=scale, factor=factor, restricted_chars=restricted_chars, mode=mode, gop=gop, restriction=restriction, external_scorer=kw['external_scorer']) "Encountered Zero-Division for the comparison of " "{0} and {1}".format( ''.join(self[idxA, self._segments]), ''.join(self[idxB, self._segments]))) else:
self, method='sca', cluster_method='upgma', threshold=0.3, scale=0.5, factor=0.3, restricted_chars='_T', mode='overlap', gop=-2, restriction='', ref='', external_function=None, **keywords): """ Function for flat clustering of words into cognate sets.
Parameters ---------- method : {'sca','lexstat','edit-dist','turchin'} (default='sca') Select the method that shall be used for the calculation. cluster_method : {'upgma','single','complete', 'mcl'} (default='upgma') Select the cluster method. 'upgma' (:evobib:`Sokal1958`) refers to average linkage clustering, 'mcl' refers to the "Markov Clustering Algorithm" (:evobib:`Dongen2000`). threshold : float (default=0.3) Select the threshold for the cluster approach. If set to c{False}, an automatic threshold will be calculated by calculating the average distance of unrelated sequences (use with care). scale : float (default=0.5) Select the scale for the gap extension penalty. factor : float (default=0.3) Select the factor for extra scores for identical prosodic segments. restricted_chars : str (default="T\_") Select the restricted chars (boundary markers) in the prosodic strings in order to enable secondary alignment. mode : {'global','local','overlap','dialign'} (default='overlap') Select the mode for the alignment analysis. verbose : bool (default=False) Define whether verbose output should be used or not. gop : int (default=-2) If 'sca' is selected as a method, define the gap opening penalty. restriction : {'cv'} (default="") Specify the restriction for calculations using the edit-distance. Currently, only "cv" is supported. If *edit-dist* is selected as *method* and *restriction* is set to *cv*, consonant-vowel matches will be prohibited in the calculations and the edit distance will be normalized by the length of the alignment rather than the length of the longest sequence, as described in :evobib:`Heeringa2006`. inflation : {int, float} (default=2) Specify the inflation parameter for the use of the MCL algorithm. expansion : int (default=2) Specify the expansion parameter for the use of the MCL algorithm.
""" inflation=2, expansion=2, max_steps=1000, add_self_loops=True, guess_threshold=False, gt_trange=(0.4, 0.6, 0.02), mcl_logs=lambda x: -np.log2((1 - x) ** 2), gt_mode='average', matrix_type='distances', link_threshold=False, _return_matrix=False, # help function for test purposes defaults=False, external_scorer=False, # external scoring dictionary )
# check for parameters and add clustering, in order to make sure that # analyses are not repeated
method, cluster_method, threshold)
'lexstat', 'sca', 'turchin', 'edit-dist', 'custom', 'infomap', 'link_clustering']:
# set up clustering algorithm, first the simple basics else:
# make a dictionary that stores the clusters for later update
# create a matrix iterator method=method, scale=scale, factor=factor, restricted_chars=restricted_chars, mode=mode, gop=gop, restriction=restriction, **kw)
# check for full consideration of basic t clustering.best_threshold(m, kw['gt_trange']))
# new method for threshold estimation based on calculating # approximate random distributions of similarities for each # sequence method, restricted_chars=restricted_chars, mode=mode, scale=scale, factor=factor, gop=gop, return_distance=True) desc='THRESHOLD DETERMINATION', total=len(self.pairs)-len(self.cols)) as progress: 0, len(pairs) - 1)][1]) for i in range(len(pairs) // 20 or 5)]
desc='SEQUENCE CLUSTERING', total=len(self.rows)) as progress:
# check for keyword to guess the threshold # FIXME: considering new function here JML # elif kw['guess_threshold'] and kw['gt_mode'] == 'nullditem': # pass else:
# specific clustering for fuzzy methods, currently not yet # supported if cluster_method in ['fuzzy']: # # ['link_communities','lc','lcl']: clusters = [[d + k for d in # c[i]] for i in range(len(matrix))] tests = [] for clrx in # clusters: for x in clrx: tests += [x] k = max(tests) for # idxA, idxB in zip(indices, clusters): clr[idxA] = idxB # else: if 1: # extract the clusters
# reassign the "k" value
# add values to cluster dictionary
'turchin', 'lexstat', 'sca', 'custom'] else 'editid' ref, clr, util.identity, override=kw.get('override', False))
# assign thresholds to parameters
self, method, mode, scale, factor, gop, sample, edit_dist_normalized): """ Parameters ---------- sample : callable Callable returning an iterator of pairs sampled from the list of pairs passed as sole argument. edit_dist_normalized : bool Whether edit_dist should be normalized.
Returns ------- Generator of lists of distances for sampled pairs per taxa pair. """ method, distance=True, return_distance=True, pprint=False, mode=mode, scale=scale, factor=factor, gop=gop, normalized=edit_dist_normalized)
self, method='lexstat', runs=100, mode='overlap', gop=-2, scale=0.5, factor=0.3, restricted_chars='T\_'): """ Method calculates randoms scores for unrelated words in a dataset.
Parameters ---------- method : {'sca','lexstat','edit-dist','turchin'} (default='sca') Select the method that shall be used for the calculation. runs : int (default=100) Select the number of random alignments for each language pair. mode : {'global','local','overlap','dialign'} (default='overlap') Select the mode for the alignment analysis. gop : int (default=-2) If 'sca' is selected as a method, define the gap opening penalty. scale : float (default=0.5) Select the scale for the gap extension penalty. factor : float (default=0.3) Select the factor for extra scores for identical prosodic segments. restricted_chars : str (default="T\_") Select the restricted chars (boundary markers) in the prosodic strings in order to enable secondary alignment.
Returns ------- D : c{numpy.array} An array with all distances calculated for each sequence pair. """ [(x, y) for x in range(len(pairs)) for y in range(len(pairs))], len(pairs))
method, mode, scale, factor, gop, sample, False):
self, method='sca', mode='overlap', gop=-2, scale=0.5, factor=0.3, restricted_chars='T\_', aggregate=True): """ Method calculates different distance estimates for language pairs.
Parameters ---------- method : {'sca','lexstat','edit-dist','turchin'} (default='sca') Select the method that shall be used for the calculation. runs : int (default=100) Select the number of random alignments for each language pair. mode : {'global','local','overlap','dialign'} (default='overlap') Select the mode for the alignment analysis. gop : int (default=-2) If 'sca' is selected as a method, define the gap opening penalty. scale : float (default=0.5) Select the scale for the gap extension penalty. factor : float (default=0.3) Select the factor for extra scores for identical prosodic segments. restricted_chars : str (default="T\_") Select the restricted chars (boundary markers) in the prosodic strings in order to enable secondary alignment. aggregate : bool (default=True) Return aggregated distances in form of a distance matrix for all taxa in the data.
Returns ------- D : c{numpy.array} An array with all distances calculated for each sequence pair. """ method, mode, scale, factor, gop, util.identity, True): else:
""" Computes the frequencies of a given wordlist.
Parameters ---------- ftype: str (default='sounds') The type of frequency which shall be calculated. Select between "sounds" (type-token frequencies of sounds), and "wordlength" (average word length per taxon or in aggregated form), or "diversity" for the diversity index (requires that you have carried out cognate judgments, and make sure to set the "ref" keyword to the column in which your cognates are). ref : str (default="tokens") The reference column, with the column for "tokens" as a default. Make sure to modify this keyword in case you want to check for the "diversity". aggregated : bool (default=False) Determine whether frequencies should be calculated in an aggregated way, for all languages, or on a language-per-language basis.
Returns ------- freqs : {dict, float} Depending on the selection of the datatype you chose, this returns either a dictionary containing the frequencies or a float indicating the ratio. """
(len(self) - self.height)
""" Write data to file.
Parameters ---------- fileformat : {'tsv', 'tre','nwk','dst', 'taxa','starling', \ 'paps.nex', 'paps.csv'} The format that is written to file. This corresponds to the file extension, thus 'tsv' creates a file in tsv-format, 'dst' creates a file in Phylip-distance format, etc. filename : str Specify the name of the output file (defaults to a filename that indicates the creation date). subset : bool (default=False) If set to c{True}, return only a subset of the data. Which subset is specified in the keywords 'cols' and 'rows'. cols : list If *subset* is set to c{True}, specify the columns that shall be written to the csv-file. rows : dict If *subset* is set to c{True}, use a dictionary consisting of keys that specify a column and values that give a Python-statement in raw text, such as, e.g., "== 'hand'". The content of the specified column will then be checked against statement passed in the dictionary, and if it is evaluated to c{True}, the respective row will be written to file. ref : str Name of the column that contains the cognate IDs if 'starling' is chosen as an output format. missing : { str, int } (default=0) If 'paps.nex' or 'paps.csv' is chosen as fileformat, this character will be inserted as an indicator of missing data. tree_calc : {'neighbor', 'upgma'} If no tree has been calculated and 'tre' or 'nwk' is chosen as output format, the method that is used to calculate the tree. threshold : float (default=0.6) The threshold that is used to carry out a flat cluster analysis if 'groups' or 'cluster' is chosen as output format. ignore : { list, "all" } Modifies the output format in "tsv" output and allows to ignore certain blocks in extended "tsv", like "msa", "taxa", "json", etc., which should be passed as a list. If you choose "all" as a plain string and not a list, this will ignore all additional blocks and output only plain "tsv". prettify : bool (default=True) Inserts comment characters between concepts in the "tsv" file output format, which makes it easier to see blocks of words denoting the same concept. Switching this off will output the file in plain "tsv".
See also -------- ~lingpy.basic.wordlist.Wordlist.output ~lingpy.align.sca.Alignments.output
""" return kw # pragma: no cover
kw['filename'] + '.scorer', scorer2str( kw.get('scorer', self.rscorer))) else: |