Coverage for lingpy/align/sca.py : 96%

Hot-keys on this page
r m x p toggle line displays
j k next/prev highlighted chunk
0 (zero) top of page
1 (one) first highlighted chunk
# *-* coding: utf-8 *-* Basic module for pairwise and multiple sequence comparison.
The module consists of four classes which deal with pairwise and multiple sequence comparison from the *sequence* and the *alignment* perspective. The sequence perspective deals with unaligned sequences. The *alignment* perspective deals with aligned sequences.
"""
ipa2tokens, token2class, tokens2class, class2tokens, prosodic_string, prosodic_weights, tokens2morphemes)
""" Basic class for carrying out multiple sequence alignment analyses.
Parameters ---------- infile : file A file in ``msq``-format or ``msa``-format. merge_vowels : bool (default=True) Indicate, whether neighboring vowels should be merged into diphtongs, or whether they should be kept separated during the analysis. comment : char (default='#') The comment character which, inserted in the beginning of a line, prevents that line from being read. normalize : bool (default=True) Normalize the alignment, that is, add gap characters for all sequences which are shorter than the longest sequence, and delete all columns from the alignment in which only gaps occur.
Examples -------- Get the path to a file from the testset.
>>> from lingpy import * >>> path = rc("test_path")+'harry.msq'
Load the file into the Multiple class.
>>> mult = Multiple(path)
Carry out a progressive alignment analysis of the sequences.
>>> mult.prog_align()
Print the result to the screen:
>>> print(mult) w o l - d e m o r t w a l - d e m a r - v - l a d i m i r -
Notes ----- There are two possible input formats for this class: the MSQ-format, and the MSA-format (see :ref:`msa_formats` for details). This class directly inherits all methods of the :py:class:`~lingpy.align.multiple.Multiple` class.
"""
keywords, comment=rcParams['comment'], diacritics=rcParams['diacritics'], vowels=rcParams['vowels'], tones=rcParams['tones'], combiners=rcParams['combiners'], breaks=rcParams['breaks'], stress=rcParams['stress'], merge_vowels=rcParams['merge_vowels'], ids=False, header=True, normalize=True)
# initialization checks first, whether we are dealing with msa-files or # with other, unaligned, sequence files and starts the # loading-procedures accordingly else:
""" Initialize by passing a dictionary with the relevant values. """
""" Retrieve sound-class strings from aligned IPA sequences.
Parameters ---------- model : str (default='sca') The sound-class model according to which the sequences shall be converted.
Notes ----- This function is only useful when an ``msa``-file with already conducted alignment analyses was loaded. """ stress=rcParams['stress'], diacritics=rcParams['diacritics'], cldf=False)
# redefine the sequences of the Multiple class tokens2class(seq.split(' '), self.model, stress=keywords['stress'], diacritics=keywords['diacritics'], cldf=keywords['cldf']) for seq in self.seqs]
# define the scoring dictionaries according to the methods ''.join( class2tokens(class_strings[i], aligned_seqs[i])).replace('-', 'X')))
self, fileformat='msa', filename=None, sorted_seqs=False, unique_seqs=False, **keywords): """ Write data to file.
Parameters ---------- fileformat : { "psa", "msa", "msq" } Indicate which data should be written to file. Select between:
* "psa" -- output of all pairwise alignments in ``psa``-format, * "msa" -- output of the multiple alignment in ``msa``-format, or * "msq" -- output of the multiple sequences in ``msq``-format. * "html" -- output of the multiple alignment in ``html``-format.
filename : str Select a specific name for the outfile, otherwise, the name of the infile will be taken by default.
sorted_seqs : bool Indicate whether the sequences should be sorted or not (applys only to 'msa' and 'msq' output.
unique_seqs : bool Indicate whether only unique sequences should be written to file or not.
"""
fileformat='msa', filename=os.path.splitext(tmp)[0], sorted_seqs=sorted_seqs, unique_seqs=unique_seqs)
# create a specific format string in order to receive taxa of equal length
# start writing data to file
else: pass
self.seq_id, taxon, taxonB)) else:
start = False out.write(txf.format("CONSE") + '\t')
out.write('# Created using LingPy\n') if hasattr(self, 'params'): out.write('# Parameters: ' + self.params + '\n') out.write('# Created: {0}\n'.format(rcParams['timestamp']))
""" Basic class for dealing with the pairwise alignment of sequences.
Parameters ---------- infile : file A file in ``psq``-format. merge_vowels : bool (default=True) Indicate, whether neighboring vowels should be merged into diphtongs, or whether they should be kept separated during the analysis. comment : char (default='#') The comment character which, inserted in the beginning of a line, prevents that line from being read.
Attributes ---------- taxa : list A list of tuples containing the taxa of all sequence pairs. seqs : list A list of tuples containing all sequence pairs. tokens : list A list of tuples containing all sequence pairs in a tokenized form.
Notes ----- In order to read in data from text files, two different file formats can be used along with this class: the PSQ-format, and the PSA-format (see :ref:`psa_formats` for details). This class inherits the methods of the :py:class:`~lingpy.align.pairwise.Pairwise` class.
"""
keywords, comment=rcParams['comment'], diacritics=rcParams['diacritics'], vowels=rcParams['vowels'], tones=rcParams['tones'], combiners=rcParams['combiners'], breaks=rcParams['breaks'], stress=rcParams["stress"], merge_vowels=rcParams['merge_vowels'], ) # add comment-char
# import the data from the input file
# set the first parameters # delete the first line of the data, since they are no longer needed
# append the other lines of the data, they consist of triplets, # separated by double line breaks
# check the ending of the infile else:
except:
""" Load a ``psa``-file. """
([text_type(a) for a in almA], [text_type(b) for b in almB], 0))
""" Load a ``psq``-file. """ taxonA, seqA = data[i + 1].split('\t') taxonB, seqB = data[i + 2].split('\t')
""" Write the results of the analyses to a text file.
Parameters ---------- fileformat : { 'psa', 'psq' } Indicate which data should be written to file. Select between:
* 'psa' -- output of all pairwise alignments in ``psa``-format, * 'psq' -- output of the multiple sequences in ``psq``-format.
filename : str Select a specific name for the outfile, otherwise, the name of the infile will be taken by default.
""" keywords, gop=-2, model=rcParams['sca'], transform=rcParams['align_transform'], scores=False)
# define the outfile and check, whether it already exists # check whether outfile already exists
# if data is simple, just write simple data to file # determine longest taxon in order to create a format string # for taxa of equal length
else: # if fileformat == 'psa': # determine longest taxon in order to create a format string # for taxa of equal length
# get partial alignment scores
else:
['{0:.2f}'.format(s) for s in scores] ) + '\t{0:.2f}\n'.format(sum(scores)))
""" Class handles Wordlists for the purpose of alignment analyses.
Parameters ---------- infile : str The name of the input file that should conform to the basic format of the `~lingpy.basic.wordlist.Wordlist` class and define a specific ID for cognate sets. row : str (default = "concept") A string indicating the name of the row that shall be taken as the basis for the tabular representation of the word list. col : str (default = "doculect") A string indicating the name of the column that shall be taken as the basis for the tabular representation of the word list. conf : string (default='') A string defining the path to the configuration file. ref : string (default='cogid') The name of the column that stores the cognate IDs. modify_ref : function (default=False) Use a function to modify the reference. If your cognate identifiers are numerical, for example, and negative values are assigned as loans, but you want to suppress this behaviour, just set this keyword to "abs", and all cognate IDs will be converted to their absolute value. split_on_tones : bool (default=True) If set to True, this means that in the case of fuzzy alignment mode, the algorithm will attempt to split words into morphemes by tones if no explicit morpheme markers can be found.
Attributes ---------- msa : dict A dictionary storing multiple alignments as dictionaries which can be directly opened and aligned with help of the ~lingpy.align.sca.SCA function. The alignment objects are referenced by a key which is identical with the "reference" (ref-keyword) of the alignment, that is the name of the column which contains the cognate identifiers.
Notes ----- This class inherits from :py:class:`~lingpy.basic.wordlist.Wordlist` and additionally creates instances of the :py:class:`~lingpy.align.multiple.Multiple` class for all cognate sets that are specified by the *ref* keyword.
"""
self, infile, row='concept', col='doculect', conf='', modify_ref=False, _interactive=True, split_on_tones=True, ref="cogid", **keywords): "ipa", "ref": "cogid", "fuzzy": False}
# initialize the wordlist self.header else self._alias[kw['alignment']] self._alias[kw['segments']] self.header else self._alias[ kw['transcription']]
# check whether fuzzy (partial) alignment or normal alignment is # carried out, if a new namespace is used, we assume it to be plain kw['fuzzy'] or (ref in self._class_string and self._class_string[ref] \ not in ['str', 'int'])) else 'plain' # store loan-status
ipa2tokens) lambda x: ' '.join([y for y in x if y not in rcParams['gap_symbol']])) else:
split_on_tones=split_on_tones)
split_on_tones=True): """ Function adds a new set of alignments to the data.
Parameters ---------- ref: str (default=False) Use this to set the name of the column which contains the cognate sets. fuzzy: bool (default=False) If set to true, force the algorithm to treat the cognate sets as fuzzy cognate sets, i.e., as multiple cognate sets which are in order assigned to a word (proper "partial cognates"). """
# check for cognate-id or alignment-id in header
# create the alignments by assembling the ids of all sequences # set up the dictionary else:
# set up the data else: # check for partial cognates # split the string into morphemes # FIXME add keywords for morpheme segmentation tones='' if not split_on_tones else 'T' ) # get the position of the morpheme
""" Function reduces alignments which contain columns that are marked to be \ ignored by the user.
Notes -----
This function changes the data only internally: All alignments are checked as to whether they contain data that should be ignored. If this is the case, the alignments are then reduced, and stored in a specific item of the alignment string. If the method doesn't find any instances for reduction, it still makes the copies of the alignments in order to guarantee that the alignments with with we want to work are at the same place in the dictionary. """
'No alignments found in your data. ' + 'You should carry out an alignment analysis first!')
# dictionary to add new alignments class afterwards for providing quick # access
for k, d in self._meta['msa'][ref].items(): ralms = reduce_alignment(d[alignment])
""" Add alignments to column (space-separated) in order to make it easy to parse them in the wordlist editor. """
# plain mode, that means, no partial alignments log.error("There are no alignments in your data. Aborting...")
tmp[m] = ' '.join(self[m, self._alignment]) else: raise ValueError( "There are no phonetic sequences (TOKENS, ALIGNMENT, or IPA) " + "in your data.") else: # in this mode, we need to trace the order of the bits that make up # the alignemnts for key in self: # get the cognate IDs cogids = self[key, ref] # get the alignment tmp[key] = [] for i, cogid in enumerate(cogids): if cogid in self.msa[ref]: msa = self.msa[ref][cogid] idx = msa['ID'].index(key) else: # add morpheme separator as long as we don't add the last # element if i < len(cogids) - 1: tmp[key] += [rcParams['morpheme_separator']]
""" Carry out a multiple alignment analysis of the data.
Parameters ---------- method : { "progressive", "library" } (default="progressive") Select the method to use for the analysis. iteration : bool (default=False) Set to c{True} in order to use iterative refinement methods. swap_check : bool (default=False) Set to c{True} in order to carry out a swap-check. model : { 'dolgo', 'sca', 'asjp' } A string indicating the name of the :py:class:`Model \ <lingpy.data.model>` object that shall be used for the analysis. Currently, three models are supported:
* "dolgo" -- a sound-class model based on :evobib:`Dolgopolsky1986`,
* "sca" -- an extension of the "dolgo" sound-class model based on :evobib:`List2012b`, and
* "asjp" -- an independent sound-class model which is based on the sound-class model of :evobib:`Brown2008` and the empirical data of :evobib:`Brown2011` (see the description in :evobib:`List2012`.
mode : { 'global', 'dialign' } A string indicating which kind of alignment analysis should be carried out during the progressive phase. Select between:
* "global" -- traditional global alignment analysis based on the Needleman-Wunsch algorithm :evobib:`Needleman1970`,
* "dialign" -- global alignment analysis which seeks to maximize local similarities :evobib:`Morgenstern1996`.
modes : list (default=[('global',-2,0.5),('local',-1,0.5)]) Indicate the mode, the gap opening penalties (GOP), and the gap extension scale (GEP scale), of the pairwise alignment analyses which are used to create the library.
gop : int (default=-5) The gap opening penalty (GOP) used in the analysis.
scale : float (default=0.6) The factor by which the penalty for the extension of gaps (gap extension penalty, GEP) shall be decreased. This approach is essentially inspired by the exension of the basic alignment algorithm for affine gap penalties :evobib:`Gotoh1982`.
factor : float (default=1) The factor by which the initial and the descending position shall be modified.
tree_calc : { 'neighbor', 'upgma' } (default='upgma') The cluster algorithm which shall be used for the calculation of the guide tree. Select between ``neighbor``, the Neighbor-Joining algorithm (:evobib:`Saitou1987`), and ``upgma``, the UPGMA algorithm (:evobib:`Sokal1958`).
gap_weight : float (default=0) The factor by which gaps in aligned columns contribute to the calculation of the column score. When set to 0, gaps will be ignored in the calculation. When set to 0.5, gaps will count half as much as other characters.
restricted_chars : string (default="T") Define which characters of the prosodic string of a sequence reflect its secondary structure (cf. :evobib:`List2012b`) and should therefore be aligned specifically. This defaults to "T", since this is the character that represents tones in the prosodic strings of sequences. """ alignment=False, classes=rcParams['classes'], defaults=False, factor=rcParams['align_factor'], filename=self.filename, gap_weight=rcParams['gap_weight'], gop=rcParams['align_gop'], iteration=False, method='progressive', mode=rcParams['align_mode'], model=rcParams['sca'], modes=rcParams['align_modes'], output=False, plots=False, ref=False, restricted_chars=rcParams['restricted_chars'], scale=rcParams['align_scale'], scoredict=rcParams['scorer'], show=False, sonar=rcParams['sonar'], style='plain', swap_check=False, tree_calc=rcParams['align_tree_calc'], )
# create a params attribute kw['method'], kw['model'].name, text_type(kw['gop']), '{0:.1f}'.format(kw['scale']), '{0:.1f}'.format(kw['factor']), kw['tree_calc'], '{0:.1f}'.format(kw['gap_weight']), kw['restricted_chars'] ])
# check for scorer keyword else: # get the tokens else:
# convert back to external format, if scoredict is set
m._sonority_consensus m.dataset, m.seq_id, __version__, rcParams['timestamp'], params)
""" Function creates confidence scores for a given set of alignments.
Parameters ---------- scorer : :py:class:`~lingpy.algorithm._misc.ScoreDict` A *ScoreDict* object which gives similarity scores for all segments in the alignment. ref : str (default="lexstatid") The reference entry-type, referring to the cognate-set to be used for the analysis. gap_weight : {loat} (default=1.0) Determine the weight assigned to matches containing gaps.
"""
""" Make an HTML plot of the aligned data. """ keywords, title='LingPy - Automatic Cognate Judgments and Alignments', shorttitle="LingPy", dataset=self.filename, show=False, filename=self.filename, ref=False, confidence=False)
'alm', ref=keywords['ref'], filename=os.path.splitext(tmp)[0], confidence=keywords['confidence'])
self, tree=False, gaps=False, classes=False, consensus='consensus', counterpart='ipa', weights=[], return_data=False, **keywords): """ Calculate a consensus string of all MSAs in the wordlist.
Parameters ---------- msa : {c{list} ~lingpy.align.multiple.Multiple} Either an MSA object or an MSA matrix. tree : {c{str} ~lingpy.thirdparty.cogent.PhyloNode} A tree object or a Newick string along which the consensus shall be calculated. gaps : c{bool} (default=False) If set to c{True}, return the gap positions in the consensus. classes : c{bool} (default=False) Specify whether sound classes shall be used to calculate the consensus. model : ~lingpy.data.model.Model A sound class model according to which the IPA strings shall be converted to sound-class strings. return_data : c{bool} (default=False) Return the data instead of adding it in a column to the wordlist object.
""" keywords, model=rcParams['sca'], gap_scale=1.0, ref=rcParams['ref'], stress=rcParams['stress'], diacritics=rcParams['diacritics'], cldf=False)
# switch ref
# reassing ref for convenience
# check for existing alignments "No alignments could be found. You should carry out" " an alignment analysis first!")
# go on with the analysis
# temporary solution for sound-class integration prosodic_string(self.msa[ref][cog]['_sonority_consensus']) ) else: 1.0 for i in range(len(self.msa[ref][cog]['alignment']))]
alm, keywords['model'], stress=keywords['stress'], cldf=keywords['cldf'], diacritics=keywords['diacritics'] ) if c != '0'] else:
self.msa[ref][cog]['alignment'], classes=_classes, tree=tree, gaps=gaps, taxa=[text_type(taxon.replace("(", "").replace(")", "")) for taxon in self.msa[ref][cog]['taxa']], **keywords) # if there's no msa for a given cognate set, this set is a singleton else: [k[0] for k in self.etd[ref][cog] if k != 0][0], counterpart] else: )[self[_idx, ref].index(cog)]
# add consensus to dictionary
return cons_dict
# add the entries consensus, ref, lambda x: cons_dict[x], override=not self._interactive) else: consensus, ref, lambda x: ' + '.join( [' '.join(cons_dict[y]) for y in x]))
""" Write wordlist to file.
Parameters ---------- fileformat : {"tsv", "msa", "tre", "nwk", "dst", "taxa", "starling", "paps.nex", "paps.csv" "html"} The format that is written to file. This corresponds to the file extension, thus 'tsv' creates a file in tsv-format, 'dst' creates a file in Phylip-distance format, etc. Specific output is created for the formats "html" and "msa":
* "msa" will create a folder containing all alignments of all cognate sets in "msa"-format * "html" will create html-output in which words are sorted according to meaning, cognate set, and all cognate words are aligned filename : str Specify the name of the output file (defaults to a filename that indicates the creation date). subset : bool (default=False) If set to c{True}, return only a subset of the data. Which subset is specified in the keywords 'cols' and 'rows'. cols : list If *subset* is set to c{True}, specify the columns that shall be written to the csv-file. rows : dict If *subset* is set to c{True}, use a dictionary consisting of keys that specify a column and values that give a Python-statement in raw text, such as, e.g., "== 'hand'". The content of the specified column will then be checked against statement passed in the dictionary, and if it is evaluated to c{True}, the respective row will be written to file. ref : str Name of the column that contains the cognate IDs if 'starling' is chosen as an output format. missing : { str, int } (default=0) If 'paps.nex' or 'paps.csv' is chosen as fileformat, this character will be inserted as an indicator of missing data. tree_calc : {'neighbor', 'upgma'} If no tree has been calculated and 'tre' or 'nwk' is chosen as output format, the method that is used to calculate the tree. threshold : float (default=0.6) The threshold that is used to carry out a flat cluster analysis if 'groups' or 'cluster' is chosen as output format. style : str (default="id") If "msa" is chosen as output format, this will write the alignments for each msa-file in a specific format in which the first column contains a direct reference to the word via its ID in the wordlist. ignore : { list, "all" } Modifies the output format in "tsv" output and allows to ignore certain blocks in extended "tsv", like "msa", "taxa", "json", etc., which should be passed as a list. If you choose "all" as a plain string and not a list, this will ignore all additional blocks and output only plain "tsv". prettify : bool (default=True) Inserts comment characters between concepts in the "tsv" file output format, which makes it easier to see blocks of words denoting the same concept. Switching this off will output the file in plain "tsv".
See also -------- ~lingpy.basic.wordlist.Wordlist.output ~lingpy.compare.lexstat.LexStat.output
""" ref=rcParams['ref'], filename=rcParams['filename'], style="id", defaults=False, confidence=False)
# check for html fileformat
# define two vars for convenience
# define the string to which the stuff is written
# get a dictionary for concept-ids zip(self.concepts, [i + 1 for i in range(len(self.concepts))])) else: # add this line for alignments containing loans else: '{0}'.format(x) for x in self.msa[ref][cogid]['confidence'][i]] x for x in self.msa[ref][cogid]['_charmat'][i]] [a + '/' + b + '/' + c for a, b, c in zip(alm, confs, chars)] )
real_cogid, taxon, concept, cid, alm_string) + '\n' else: cogid, taxon, concept, cid, ''.join(seq)) + '\n'
os.path.join( '{0}-msa'.format(value['dataset']), '{0}-{1}.msa'.format(value['dataset'], key)), msa2str(value, wordlist=kw['style'] in ['id', 'with_id']), log=False)
""" Method returns alignment objects depending on input file or input data.
Notes ----- This method checks for the type of an alignment object and returns an alignment object of the respective type. """ keywords, comment=rcParams['comment'], # '#', diacritics=rcParams['diacritics'], # None, vowels=rcParams['vowels'], # None, tones=rcParams['tones'], # None, combiners=rcParams['combiners'], # '\u0361\u035c', breaks=rcParams['breaks'], # '.-', stress=rcParams['stress'], # "ˈˌ'", merge_vowels=rcParams['merge_vowels'], # True )
# check for datatype else: # lookup class by file extension:
""" Calculate a consensus string of a given MSA.
Parameters ---------- msa : {c{list} ~lingpy.align.multiple.Multiple} Either an MSA object or an MSA matrix. gaps : c{bool} (default=False) If set to c{True}, return the gap positions in the consensus. taxa : {c{list} bool} (default=False) If *tree* is chosen as a parameter, specify the taxa in order of the aligned strings. classes : c{bool} (default=False) Specify whether sound classes shall be used to calculate the consensus. model : ~lingpy.data.model.Model A sound class model according to which the IPA strings shall be converted to sound-class strings. local : { c{bool}, "peaks", "gaps" }(default=False) Specify whether local pre-processing should be applied to the data. If set to c{peaks}, the average alignment score of each column is taken as reference to remove low-scoring columns from the alignment. If set to "gaps", the columns with the highest proportion of gaps will be excluded.
Returns ------- cons : c{str} A consensus string of the given MSA. """
keywords, model=rcParams['sca'], stress=rcParams['stress'], cldf=False, diacritics=rcParams['diacritics'], gap_scale=1.0, mode='majority', gap_score=-10, weights=[1 for i in range(len(msa[0]))], local=False)
# transform the matrix
# custom function for tokens2class diacritics=keywords['diacritics'], stress=keywords['stress'])
# check for local peaks # calculate a local index and charB not in rcParams['gap_symbol']: tk2k(charA), tk2k(charB))) else:
# get the average,min, and max of the peaks
# exclude those lines from matrix whose average is smaller than pmean # store the number of gaps in a simple array
# we now try to get the average number of lines
# we discard all lines which are beyond the half of the average (stupid # solution, but for testing it hopefully suffices...)
# check for classes # if classes are passed as array, we use this array as is # if classes is a Model-object # if no tree is passed, it is a simple majority-rule principle that outputs # the consensus string # half the weight of gaps # if mode is set to 'maximize', calculate the score if '-' not in (c, c2): else: rcParams['gap_symbol']): else: keywords['weights'][i]
tmpC.items(), key=lambda x: (x[1], tmpA[x[0]]), reverse=True)] # check for identical classes else:
# apply check for gaps here, if there are more gaps than in the # full column, take the gaps, otherwise, take the next char [tmp[x] for x in tmp if x != rcParams['gap_symbol']]): else: else:
|