Coverage for lingpy/data/derive.py : 96%

Hot-keys on this page
r m x p toggle line displays
j k next/prev highlighted chunk
0 (zero) top of page
1 (one) first highlighted chunk
# *-* coding: utf-8 *-* Module for the derivation of sound class models.
The module provides functions for the customized compilation of sound-class models. All models are defined in simple text files. In order to guarantee their quick access when loading the library, the models are compiled and stored in binary files. """
""" Function imports individually defined sound classes from a text file and creates a replacement dictionary from these sound classes. """ ' // '.join(sorted(set(errors))), filename))
""" Function imports score trees for a given range of sound classes and converts them into a graph. """
""" Function returns all paths (_fop=find_all_paths) which connect to nodes in a network. """
""" Function finds the path connecting two nodes in a directed graph under the condition that the two nodes are connected either directly or by a common ancestor node. """
# first possibility: there is a direct path between the two nodes # if nx.shortest_path(graph,start,end) != False:
# return nx.shortest_path(graph,start,end) # else: # except: # second possibility: there is a direct path between the two nodes, but # it starts from the other node # if nx.shortest_path(graph,end,start) != False: # return nx.shortest_path(graph,end,start) # third possibility: there is no direct path between the nodes in # neither direction, but there is a path in an undirected graph # here, we simply check, whether with in all paths connecting the # two nodes there is a node which directly connects to both nodes # (i.e. which is the ancestor of both nodes). If this is the case, # the respective shortest path is what we are looking for. and end in shortest_paths[node].keys(): else: return False # fourth condition: there is no path connecting the nodes at all else: else: else:
""" Function returns the length of a path in a weighted graph. """
""" Function creates a scoring dictionary for individually defined sound classes and individually created scoring trees by counting the path length connecting all nodes and assigning different start weights for vowels and consonants. """ # the scoring dictionary which will be returned by the function
graph, _find_dir_path(graph, node1, node2)) # make sure that the distance doesn't exceed the default value.
# iterate over all nodes in the previously created graph of sound class # transitions # check, whether the key has already been created # if not, create the key # if the nodes are the same, assign them the values for # vowel-vowel or consonant-consonant identity # these values might be made changeable in later versions # for vowels and glides, the same starting value is assumed # make sure, that tones do not score else: # if the nodes are different, see, if there is a connection # between them defined in the directed network else: # treat vowel-vowel and consonant-consonant matches # differently
# for vowels and glides, the starting value to subtract the # weighted pathlength from is the vowel-vowel-identity # score # make sure that the distance doesn't exceed the # default value for vowel-vowel matches, which # should be zero, if there is no connection in the # path defined
# for consonants, the starting value is the # consonant-consonant score # make sure that the minimum value of C-C-matches is zero else: # make sure that tone-tone classes score with zero # for vowel-consonant, vowel-glide and glide-consonant # matches, the starting value is the vowel-vowel score (may # also be changed in later versions) else:
# make sure to exclude tones from all matchings in # order to force the algorithm to align tones with # tones or gaps and with nothing else # matches of glides with different classes # glides and vowels or glides and consonants else: else:
# add the characters for gaps in the multiple alignment process # note that gaps and gaps should be scored by zero according to Feng & # Doolittle. so far I have scored them as -1, and scoring gaps as zero made # the alignments getting worse, probably because most tests have been based # on profiles. we probably need a very good gap score. # missing data
# swaps
# specific values score_dict[(node, 'X')] = 0 else:
# define the gaps
""" Function exports a scoring dictionary to a csv-file.
@todo: This function can be better ported to another file. """ letters = list(set([key[0] for key in score_dict.keys()])) rows = [['+'] + letters] rows.append([l1] + [str(score_dict[(l1, l2)]) for l2 in letters]) util.write_text_file('score_dict.csv', '\n'.join('\t'.join(row) for row in rows))
""" Function compiles customized sound-class models.
Parameters ----------
model : str A string indicating the name of the model which shall be created.
path : str A string indication the path where the model-folder is stored.
Notes ----- A model is defined by a folder placed in :file:`data/models` directory of the LingPy package. The name of the folder reflects the name of the model. It contains three files: the file :file:`converter`, the file :file:`INFO`, and the optional file :file:`scorer`. The format requirements for these files are as follows:
:file:`INFO` The ``INFO``-file serves as a reference for a given sound-class model. It can contain arbitrary information (and also be empty). If one wants to define specific characteristics, like the ``source``, the ``compiler``, the ``date``, or a ``description`` of a given model, this can be done by employing a key-value structure in which the key is preceded by an ``@`` and followed by a colon and the value is written right next to the key in the same line, e.g.::
@source: Dolgopolsky (1986)
This information will then be read from the ``INFO`` file and rendered when printing the model to screen with help of the :py:func:`print` function.
:file:`converter` The ``converter`` file contains all sound classes which are matched with their respective sound values. Each line is reserved for one class, precede by the key (preferably an ASCII-letter) representing the class::
B : ɸ, β, f, p͡f, p͜f, ƀ E : ɛ, æ, ɜ, ɐ, ʌ, e, ᴇ, ə, ɘ, ɤ, è, é, ē, ě, ê, ɚ D : θ, ð, ŧ, þ, đ G : x, ɣ, χ ...
:file:`matrix` A scoring matrix indicating the alignment scores of all sound-class characters defined by the model. The scoring is structured as a simple tab-delimited text file. The first cell contains the character names, the following cells contain the scores in redundant form (with both triangles being filled)::
B 10.0 -10.0 5.0 ... E -10.0 5.0 -10.0 ... F 5.0 -10.0 10.0 ... ...
:file:`scorer` The ``scorer`` file (which is optional) contains the graph of class-transitions which is used for the calculation of the scoring dictionary. Each class is listed in a separate line, followed by the symbols ``v``,``c``, or ``t`` (indicating whether the class represents vowels, consonants, or tones), and by the classes it is directly connected to. The strength of this connection is indicated by digits (the smaller the value, the shorter the path between the classes)::
A : v, E:1, O:1 C : c, S:2 B : c, W:2 E : v, A:1, I:1 D : c, S:2 ...
The information in such a file is automatically converted into a scoring dictionary (see :evobib:`List2012b` for details).
Based on the information provided by the files, a dictionary for the conversion of IPA-characters to sound classes and a scoring dictionary are created and stored as a binary. The model can be loaded with help of the :py:class:`~lingpy.data.model.Model` class and used in the various classes and functions provided by the library.
See also -------- lingpy.data.model.Model compile_dvt
""" # get the path to the models
# load the sound classes
# dump the data
# try to load the scoring function or the score tree
# calculate the scoring dictionary
# make score_dict a ScoreDict instance range(len(chars))]
else:
""" Function compiles diacritics, vowels, and tones.
Notes ----- Diacritics, vowels, and tones are defined in the :file:`data/models/dv/` directory of the LingPy package and automatically loaded when loading the LingPy library. The values are defined as the constants :py:obj:`rcParams['vowels']`, :py:obj:`rcParams['diacritics']`, and :py:obj:`rcParams['tones']`. Their core purpose is to guide the tokenization of IPA strings (cf. :py:func:`~lingpy.sequence.sound_classes.ipa2tokens`). In order to change the variables, one simply has to change the text files :file:`diacritics`, :file:`tones`, and :file:`vowels` in the :file:`data/models/dv` directory. The structure of these files is fairly simple: Each line contains a vowel or a diacritic character, whereas diacritics are preceded by a dash.
See also -------- lingpy.data.model.Model lingpy.data.derive.compile_model """
# get the path to the models else: file_path = path
# normalize stuff # TODO: this is potentially dangerous and it is important to decide whether # TODO: switching to NFD might not be a better choice os.path.join(file_path, name), normalize='NFC').replace('\n', '')
else:
|