Coverage for lingpy/sequence/sound_classes.py : 99%

Hot-keys on this page
r m x p toggle line displays
j k next/prev highlighted chunk
0 (zero) top of page
1 (one) first highlighted chunk
# *-* coding: utf-8 *-* Module provides various methods for the handling of sound classes. """
""" Tokenize IPA-encoded strings.
Parameters ----------
seq : str The input sequence that shall be tokenized.
diacritics : {str, None} (default=None) A string containing all diacritics which shall be considered in the respective analysis. When set to *None*, the default diacritic string will be used.
vowels : {str, None} (default=None) A string containing all vowel symbols which shall be considered in the respective analysis. When set to *None*, the default vowel string will be used.
tones : {str, None} (default=None) A string indicating all tone letter symbals which shall be considered in the respective analysis. When set to *None*, the default tone string will be used.
combiners : str (default="\u0361\u035c") A string with characters that are used to combine two separate characters (compare affricates such as t͡s).
breaks : str (default="-.") A string containing the characters that indicate that a new token starts right after them. These can be used to indicate that two consecutive vowels should not be treated as diphtongs or for diacritics that are put before the following letter.
merge_vowels : bool (default=False) Indicate, whether vowels should be merged into diphtongs (default=True), or whether each vowel symbol should be considered separately.
merge_geminates : bool (default=False) Indicate, whether identical symbols should be merged into one token, or rather be kept separate.
expand_nasals : bool (default=False)
semi_diacritics: str (default='') Indicate which symbols shall be treated as "semi-diacritics", that is, as symbols which can occur on their own, but which eventually, when preceded by a consonant, will form clusters with it. If you want to disable this features, just set the keyword to an empty string. clean_string : bool (default=False) Conduct a rough string-cleaning strategy by which all items between brackets are removed along with the brackets, and
Returns ------- tokens : list A list of IPA tokens.
Examples -------- >>> from lingpy import * >>> myseq = 't͡sɔyɡə' >>> ipa2tokens(myseq) ['t͡s', 'ɔy', 'ɡ', 'ə']
See also -------- tokens2class class2tokens """ # go for defaults breaks=rcParams['breaks'], combiners=rcParams['combiners'], diacritics=rcParams['diacritics'], expand_nasals=False, merge_geminates=True, merge_vowels=rcParams['merge_vowels'], semi_diacritics='', stress=rcParams['stress'], tones=rcParams['tones'], vowels=rcParams['vowels'], clean_sequence=False # add this later, not today XXX )
# check for pre-tokenized strings else:
# create the list for the output
# set basic characteristics
# define nasals and nasal chars and semi_diacritics
# check for nasal stack and vowel environment
# check for breaks first, since they force us to start anew
# check for combiners next
# check for stress # XXX be careful about the removement of the start-flag here, but it # XXX seems to make sense so far!
# check for merge command
# check for nasals in NFC normalization and non-normalizable nasals
# check for weak diacritics and not tone and out[-1] not in nogos:
# check for diacritics else:
# check for vowels else:
# check for tones else:
# consonants else:
else:
""" Carry out a simple syllabification of a sequence, using sonority as a proxy.
Parameters ---------- output: {"flat", "breakpoints", "nested"} (default="flat") Define how to output the syllabification. Select between: * "flat": A syllable separator is introduced to mark the syllable boundaries * "breakpoins": A tuple consisting of indices that slice the original sequence into syllables is returned. * "nested": A nested list reflecting the syllable structure is returned.
sep : str (default="◦") Select your preferred syllable separator.
Notes -----
When analyzing the sequence, we start a new syllable in all cases where we reach a deepest point in the sonority hierarchy of the sonority profile of the sequence. When passing an aligned string to this function, the gaps will be ignored when computing boundaries, but later on re-introduced, if the alignment is passed in segmented form.
Returns -------
syllable : list Either a flat list containing a morpheme separator, or a nested list, reflecting the syllable structure, or a list of tuples containing the indices indicating where the input sequence should be sliced in order to split it into syllables.
""" "The output «{0}» you specified is not available.".format(output))
"sep": rcParams['morpheme_separator'], "gap": rcParams['gap_symbol'], "model": "art", "stress": rcParams['stress'], "diacritics": rcParams['diacritics'], "cldf": False }
# we assume we are dealing with tokens if the syllable is a list. else:
# check whether our sequence is an alignment
# get the profile for the sequence stress=kw['stress'], diacritics=kw['diacritics'])] + [0]
# get the pro-tokens
# get the char
# simple rule: we start a new syllable, if p2 is smaller or equal to p1 and p3 # is larger than p2 # don't break if we are in the initial and no vowel followed # can be expanded to general "vowel needs to follow"-rule
# don't break if we are in the end of the word else: # break always if there's a tone
# get the char before, after # control for break chars which are already there else: else:
# if we detected an alignment character in the string, we need to reparse # the data
else:
""" Helper function determines the points where to split a sequence. """ else:
""" Split a string into morphemes if it contains separators.
Notes ----- Function splits a list of tokens into subsequent lists of morphemes if the list contains morpheme separators. If no separators are found, but tonemarkers, it will still split the string according to the tones. If you want to avoid this behavior, set the keyword **split_on_tones** to False.
Parameters ---------- sep : str (default="◦") Select your morpheme separator. word_sep: str (default="_") Select your word separator.
Returns ------- morphemes : list A nested list of the original segments split into morphemes. """
"sep": rcParams['morpheme_separator'], "word_sep": rcParams['word_separator'], "word_seps": rcParams['word_separators'], "seps": rcParams['morpheme_separators'], "split_on_tones": True, "tone": "T" }
# check for other hints than the clean separators in the data and '+' not in class_string and '_' not in class_string: else: else: # check for bad examples
""" Split the output of the syllabify method into subsets.
Notes ----- This is a simple helper function to deal with syllabified content. """
else: # reconsider deleting these lines, since they # may well confuse the algorithms and we should # better restrict all actions to but one syllable separator
""" Helper function for string output of ono-parsed words. """ x for x in ono]) out = [] for k in ono: out.append(k[0] or '-')
""" Carry out a rough onset-nucleus-offset parse of a word in IPA.
Notes ----- Method is an approximation and not supposed to do without flaws. It is, however, rather helpful in most instances. It defines a so far simple model in which 7 different contexts for each word are distinguished:
* "#": onset cluster in a word's initial * "C": onset cluster in a word's non-initial * "V": nucleus vowel in a word's initial syllable * "v": nucleus vowel in a word's non-initial and non-final syllable * ">": nucleus vowel in a word's final syllable * "c": offset cluster in a word's non-final syllable * "$": offset cluster in a word's final syllable
""" "sep": rcParams['morpheme_separator'], "gap": rcParams['gap_symbol'], "model": "art", "stress": rcParams['stress'], "diacritics": rcParams['diacritics'], "cldf": False } else:
# we take lists to restore internal tokenization
# correct parse by type
else:
else:
# linearize parse [XXX bad solution but too lazy to correct it in this # stage @lingulist] else:
ipa2tokens( seq, diacritics='*$~"', vowels='aeiouE3', tones='', combiners='', merge_vowels=merge_vowels ) )
""" Convert a single token into a sound-class.
tokens : str A token (phonetic segment).
model : :py:class:`~lingpy.data.model.Model` A :py:class:`~lingpy.data.model.Model` object.
stress : str (default=rcParams['stress']) A string containing the stress symbols used in the analysis. Defaults to the stress as defined in ~lingpy.settings.rcParams.
diacritics : str (default=rcParams['diacritics']) A string containing diacritic symbols used in the analysis. Defaults to the diacritic symbolds defined in ~lingpy.settings.rcParams.
cldf : bool (default=False) If set to True, this will allow for a specific treatment of phonetic symbols which cannot be completely resolved (e.g., laryngeal h₂ in Indo-European). Following the `CLDF <http://cldf.clld.org>`_ specifications (in particular the specifications for writing transcriptions in segmented strings, as employed by the `CLTS <http://calc.digling.org/clts/>`_ initiative), in cases of insecurity of pronunciation, users can adopt a ```source/target``` style, where the source is the symbol used, e.g., in a reconstruction system, and the target is a proposed phonetic interpretation. This practice is also accepted by the `EDICTOR <http://edictor.digling.org>`_ tool.
Returns -------
sound_class : str A sound-class representation of the phonetic segment. If the segment cannot be resolved, the respective string will be rendered as "0" (zero).
See also -------- ipa2tokens class2tokens token2class
""" # check basic parameters
# change token if cldf is selected
# check whether model is passed as real model or as string
# check for stressed syllables # new character for missing data and spurious items return model[token[1]] else: else: # new character for missing data and spurious items else:
""" Convert tokenized IPA strings into their respective class strings.
Parameters ----------
tokens : list A list of tokens as they are returned from :py:func:`ipa2tokens`.
model : :py:class:`~lingpy.data.model.Model` A :py:class:`~lingpy.data.model.Model` object.
stress : str (default=rcParams['stress']) A string containing the stress symbols used in the analysis. Defaults to the stress as defined in ~lingpy.settings.rcParams.
diacritics : str (default=rcParams['diacritics']) A string containing diacritic symbols used in the analysis. Defaults to the diacritic symbolds defined in ~lingpy.settings.rcParams.
cldf : bool (default=False) If set to True, this will allow for a specific treatment of phonetic symbols which cannot be completely resolved (e.g., laryngeal h₂ in Indo-European). Following the `CLDF <http://cldf.clld.org>`_ specifications (in particular the specifications for writing transcriptions in segmented strings, as employed by the `CLTS <http://calc.digling.org/clts/>`_ initiative), in cases of insecurity of pronunciation, users can adopt a ```source/target``` style, where the source is the symbol used, e.g., in a reconstruction system, and the target is a proposed phonetic interpretation. This practice is also accepted by the `EDICTOR <http://edictor.digling.org>`_ tool.
Returns -------
classes : list A sound-class representation of the tokenized IPA string in form of a list. If sound classes cannot be resolved, the respective string will be rendered as "0" (zero).
Notes ----- The function ~lingpy.sequence.sound_classes.token2class returns a "0" (zero) if the sound is not recognized by LingPy's sound class models. While an unknown sound in a longer sequence is no problem for alignment algorithms, we have some unwanted and often even unforeseeable behavior, if the sequence is completely unknown. For this reason, this function raises a ValueError, if a resulting sequence only contains unknown sounds.
Examples -------- >>> from lingpy import * >>> tokens = ipa2tokens('t͡sɔyɡə') >>> classes = tokens2class(tokens,'sca') >>> print(classes) CUKE
See also -------- ipa2tokens class2tokens token2class
""" # raise value error if input is not an iterable (tuple or list)
diacritics=diacritics, cldf=cldf))
""" Create a prosodic string of the sonority profile of a sequence.
Parameters ----------
seq : list A list of integers indicating the sonority of the tokens of the underlying sequence.
stress : str (default=rcParams['stress']) A string containing the stress symbols used in the analysis. Defaults to the stress as defined in ~lingpy.settings.rcParams.
diacritics : str (default=rcParams['diacritics']) A string containing diacritic symbols used in the analysis. Defaults to the diacritic symbolds defined in ~lingpy.settings.rcParams.
cldf : bool (default=False) If set to True, this will allow for a specific treatment of phonetic symbols which cannot be completely resolved (e.g., laryngeal h₂ in Indo-European). Following the `CLDF <http://cldf.clld.org>`_ specifications (in particular the specifications for writing transcriptions in segmented strings, as employed by the `CLTS <http://calc.digling.org/clts/>`_ initiative), in cases of insecurity of pronunciation, users can adopt a ```source/target``` style, where the source is the symbol used, e.g., in a reconstruction system, and the target is a proposed phonetic interpretation. This practice is also accepted by the `EDICTOR <http://edictor.digling.org>`_ tool.
Returns ------- prostring : string A prosodic string corresponding to the sonority profile of the underlying sequence.
See also: ---------
prosodic weights
Notes -----
A prosodic string is a sequence of specific characters which indicating their resprective prosodic context (see :evobib:`List2012` or :evobib:`List2012a` for a detailed description). In contrast to the previous model, the current implementation allows for a more fine-graded distinction between different prosodic segments. The current scheme distinguishes 9 prosodic positions:
* ``A``: sequence-initial consonant * ``B``: syllable-initial, non-sequence initial consonant in a context of ascending sonority * ``C``: non-syllable, non-initial consonant in ascending sonority context * ``L``: non-syllable-final consonant in descending environment * ``M``: syllable-final consonant in descending environment * ``N``: word-final consonant * ``X``: first vowel in a word * ``Y``: non-final vowel in a word * ``Z``: vowel occuring in the last position of a word * ``T``: tone * ``_``: word break
Examples -------- >>> prosodic_string(ipa2tokens('t͡sɔyɡə') 'AXBZ'
""" diacritics=rcParams['diacritics'])
# check for empty string passed
# check for the right string # get the sonority profile [int(t) for t in tokens2class(string, rcParams['art'], stress=keywords['stress'], diacritics=keywords['diacritics'], cldf=keywords['cldf'])] + \ [9] else:
# check for multiple strings in string # break the string into pieces else:
# return the prostrings of the pieces recursively, note that the # additional check whether x is True is necessitated by the fact that # often errors occur in the coding, i.e. strings are given
# create the output values
# start iterating over relevant parts of the string # get all three values
# check for vowel # check for first
# check for last else: # check for tones # check for descending position # check for word final consonant # check for word final vowel else: else: # ascending # check for syllable first else: else: else:
# consonant peak else:
# dummy for other stuff else: "Conversion to prosodic string failed due to a condition which was not " "defined in the convertion, for details compare the numerical string " "{0} with the profile string {1}".format(sstring, pstring))
"A": "C", "B": "C", "C": "C", "M": "C", "L": "C", "N": "C", "X": "V", "Y": "V", "Z": "V", "T": "T", "_": "_", }
"A": "C", "B": "C", "C": "C", "M": "c", "L": "c", "N": "c", "X": "V", "Y": "V", "Z": "v", "T": "T", "_": "_", }
else: "A": "#", "B": "C", "C": "C", "M": "c", "L": "c", "N": "$", "X": "V", "Y": "v", "Z": ">" }
""" Calculate prosodic weights for each position of a sequence.
Parameters ----------
prostring : string A prosodic string as it is returned by :py:func:`prosodic_string`. _transform : dict A dictionary that determines how prosodic strings should be transformed into prosodic weights. Use this dictionary to adjust the prosodic strings to your own user-defined prosodic weight schema.
Returns ------- weights : list A list of floats reflecting the modification of the weight for each position.
Notes -----
Prosodic weights are specific scaling factors which decrease or increase the gap score of a given segment in alignment analyses (see :evobib:`List2012` or :evobib:`List2012a` for a detailed description).
Examples -------- >>> from lingpy import * >>> prostring = '#vC>' >>> prosodic_weights(prostring) [2.0, 1.3, 1.5, 0.7]
See also -------- prosodic_string
""" # check for transformer
# default scale for tonal languages '#': 1.6, 'V': 3.0, 'C': 1.2, 'c': 1.1, 'v': 3.0, # raise the cost for the gapping of vowels '<': 0.8, '$': 0.5, '>': 0.7, 'T': 1.0, '_': 0.0,
# new values for alternative prostrings 'A': 1.6, # initial 'B': 1.3, # syllable-initial 'C': 1.2, # ascending
'L': 1.1, # descending 'M': 1.1, # syllable-descending 'N': 0.5, # final
'X': 3.0, # vowel in initial syllable 'Y': 3.0, # vowel in non-final syllable 'Z': 0.7, # vowel in final syllable 'T': 1.0, # Tone '_': 0.0 # break character } # default scale for other languages else: '#': 2.0, 'V': 1.5, 'C': 1.5, 'c': 1.1, 'v': 1.3, '<': 0.8, '$': 0.8, '>': 0.7, 'T': 0.0, '_': 0.0,
# new values for alternative prostrings 'A': 2.0, # initial 'B': 1.75, # syllable-initial 'C': 1.5, # ascending
'L': 1.1, # descending 'M': 1.1, # syllable-descending 'N': 0.8, # final
'X': 1.5, # vowel in initial syllable 'Y': 1.3, # vowel in non-final syllable 'Z': 0.8, # vowel in final syllable 'T': 0.0, # Tone '_': 0.0 # break character
}
""" Turn aligned sound-class sequences into an aligned sequences of IPA tokens.
Parameters ----------
tokens : list The list of tokens corresponding to the unaligned IPA string.
classes : string or list The aligned class string.
gap_char : string (default="-") The character which indicates gaps in the output string.
local : bool (default=False) If set to *True* a local alignment with prefix and suffix can be converted.
Returns ------- alignment : list A list of tokens with gaps at the positions where they occured in the alignment of the class string.
See also -------- ipa2tokens tokens2class
Examples -------- >>> from lingpy import * >>> tokens = ipa2tokens('t͡sɔyɡə') >>> aligned_sequence = 'CU-KE' >>> print ', '.join(class2tokens(tokens,aligned_sequence)) t͡s, ɔy, -, ɡ, ə
""" else: # get the length of the prefix
# get the suffix
# get the substring
# start the loop
""" Calculate the Percentage Identity (PID) score for aligned sequence pairs.
Parameters ----------
almA, almB : string or list The aligned sequences which can be either a string or a list.
mode : { 1, 2, 3, 4, 5 } Indicate which of the four possible PID scores described in :evobib:`Raghava2006` should be calculated, the fifth possibility is added for linguistic purposes:
1. identical positions / (aligned positions + internal gap positions),
2. identical positions / aligned positions,
3. identical positions / shortest sequence, or
4. identical positions / shortest sequence (including internal gap pos.)
5. identical positions / (aligned positions + 2 * number of gaps)
Returns -------
score : float The PID score of the given alignment as a floating point number between 0 and 1.
Notes -----
The PID score is a common measure for the diversity of a given alignment. The implementation employed by LingPy follows the description of :evobib:`Raghava2006` where four different variants of PID scores are distinguished. Essentially, the PID score is based on the comparison of identical residue pairs with the total number of residue pairs in a given alignment.
Examples -------- Load an alignment from the test suite.
>>> from lingpy import * >>> pairs = PSA(get_file('test.psa'))
Extract the alignments of the first aligned sequence pair.
>>> almA,almB,score = pairs.alignments[0]
Calculate the PID score of the alignment.
>>> pid(almA,almB) 0.44444444444444442
See also -------- lingpy.compare.Multiple.get_pid
.. todo:: change debug for ZeroDivisionError
"""
len([i for i in almA if i != '-']), len([i for i in almB if i != '-']))
len(''.join([i[0] for i in almA]).strip('-')), len(''.join([i[0] for i in almB]).strip('-')))
""" Function checks whether tokens are given in a consistent input format. """ diacritics=rcParams['diacritics'], cldf=False) # check for conversion within the articulation-model cldf=keywords['cldf'], diacritics=keywords['diacritics'])
""" Function returns all possible n-grams of a given sequence.
Parameters ---------- sequence : list or str The sequence that shall be converted into it's ngram-representation.
Returns ------- out : list A list of all ngrams of the input word, sorted in decreasing order of length.
Examples -------- >>> get_all_ngrams('abcde') ['abcde', 'bcde', 'abcd', 'cde', 'abc', 'bcd', 'ab', 'de', 'cd', 'bc', 'a', 'e', 'b', 'd', 'c']
"""
# get the length of the word
# determine the starting point
# define the output list
# start the while loop # copy the sequence
# append the sequence to the output list
# loop over the new sequence
# increment i and decrement l
""" Convert sequence in IPA-sampa-format to IPA-unicode.
Notes ----- This function is based on code taken from Peter Kleiweg (http://www.let.rug.nl/~kleiweg/L04/devel/python/xsampa.html).
"""
return sequence.split(' ')
""" Convert a given sequence into a sequence of bigrams. """
""" Convert a given sequence into a sequence of trigrams. """
""" Convert a given sequence into a sequence of trigrams. """ zip( ['#', '#', '#'] + seq, ['#', '#'] + seq + ['$'], ['#'] + seq + ['$', '$'], seq + ['$', '$', '$'] ) )
""" convert a given sequence into a sequence of ngrams. """
""" Convert a given sequence into bigrams consisting of prosodic string symbols and the tokens of the original sequence. """ else:
'Item «{0}» does not have a counterpart!'.format(b))
sequence, semi_diacritics='hsʃ̢ɕʂʐʑʒw', merge_vowels=False, segmentized=False, rules=None, ignore_brackets=True, brackets=None, split_entries=True, splitters='/,;~', preparse=None, merge_geminates=True, normalization_form="NFC"): """ Function exhaustively checks how well a sequence is understood by \ LingPy.
Parameters ---------- semi_diacritics : str Indicate characters which can occur both as "diacritics" (second part in a sound) or alone. merge_vowels : bool (default=True) Indicate whether consecutive vowels should be merged. segmentized : False Indicate whether the input string is already segmentized or not. If set to True, items in brackets can no longer be ignored. rules : dict Replacement rules to be applied to a segmentized string. ignore_brackets : bool If set to True, ignore all content within a given bracket. brackets : dict A dictionary with opening brackets as key and closing brackets as values. Defaults to a pre-defined set of frequently occurring brackets. split_entries : bool (default=True) Indicate whether multiple entries (with a comma etc.) should be split into separate entries. splitters : str The characters which force the automatic splitting of an entry. prepares : list List of tuples, giving simple replacement patterns (source and target), which are applied before every processing starts.
Returns ------- cleaned_strings : list A list of cleaned strings which are segmented by space characters. If splitters are encountered, indicating that the entry contains two variants, the list will contain one for each element in a separate entry. If there are no splitters, the list has only size one. """
# replace white space if not indicated otherwise tuple)) else sequence] else: else:
# splitting needs to be done afterwards brackets='' if not ignore_brackets else brackets) else:
re.sub(r'\s+', '_', new_sequence.strip()), semi_diacritics=semi_diacritics, merge_vowels=merge_vowels, merge_geminates=merge_geminates)
"Return unicode codepoint(s) for a character set."
|