Coverage for lingpy/evaluate/acd.py : 100%

Hot-keys on this page
r m x p toggle line displays
j k next/prev highlighted chunk
0 (zero) top of page
1 (one) first highlighted chunk
# *-* coding: utf-8 *-* Evaluation methods for automatic cognate detection. """
""" Print out the results of an analysis. """
* {0:7}-Scores * * --------------------- * * Precision: {1:.4f} * * Recall: {2:.4f} * * F-Scores: {3:.4f} * *************************'""".format( results, p, r, f)
per_concept=False): """ Compute B-Cubed scores for test and reference datasets.
Parameters ---------- lex : :py:class:`lingpy.basic.wordlist.Wordlist` A :py:class:`lingpy.basic.wordlist.Wordlist` class or a daughter class, (like the :py:class:`~lingpy.compare.lexstat.LexStat` class used for the computation). It should have two columns indicating cognate IDs. gold : str (default='cogid') The name of the column containing the gold standard cognate assignments. test : str (default='lexstatid') The name of the column containing the automatically implemented cognate assignments. modify_ref : function (default=False) Use a function to modify the reference. If your cognate identifiers are numerical, for example, and negative values are assigned as loans, but you want to suppress this behaviour, just set this keyword to "abs", and all cognate IDs will be converted to their absolute value. pprint : bool (default=True) Print out the results per_concept : bool (default=False) Compute b-cubed scores per concep and not for the whole data in one piece.
Returns ------- t : tuple A tuple consisting of the precision, the recall, and the harmonic mean (F-scores).
Notes ----- B-Cubed scores were first described by :evobib:`Bagga1998` as part of an algorithm. Later on, :evobib:`Amigo2009` showed that they can also used as to compare cluster decisions. :evobib:`Hauer2011` applied the B-Cubed scores first to the task of automatic cognate detection.
See also -------- diff pairs """ # if loans are treated as homologs
# check for linesize # get cognate-ids in the other set for the line
# get the recall else:
concept, p, r, f), pprint=pprint) else: # b-cubed recall # b-cubed precision
# calculate general scores
""" Compute B-Cubed scores for test and reference datasets for partial cognate\ detection.
Parameters ---------- wordlist : :py:class:`~lingpy.basic.wordlist.Wordlist` A :py:class:`~lingpy.basic.wordlist.Wordlist`, or one of it's daughter classes (like, e.g., the :py:class:`~lingpy.compare.partial.Partial` class used for computation of partial cognates. It should have two columns indicating cognate IDs. gold : str (default='cogid') The name of the column containing the gold standard cognate assignments. test : str (default='lexstatid') The name of the column containing the automatically implemented cognate assignments. pprint : bool (default=True) Print out the results
Returns ------- t : tuple A tuple consisting of the precision, the recall, and the harmonic mean (F-scores).
Notes ----- B-Cubed scores were first described by :evobib:`Bagga1998` as part of an algorithm. Later on, :evobib:`Amigo2009` showed that they can also used as to compare cluster decisions. :evobib:`Hauer2011` applied the B-Cubed scores first to the task of automatic cognate detection.
See also -------- bcubes diff pairs """
# here's the point with bcubes for fuzzy: if we compare, we need to make # sure we count whether one instance is identical, not whether all of them # are identical!
# now we need to get the position in the index wordlist[idx,one]) if cog == k] else:
pprint=pprint)
_return_string=False): """ Compute pair scores for the evaluation of cognate detection algorithms.
Parameters ---------- lex : :py:class:`lingpy.compare.lexstat.LexStat` The :py:class:`~lingpy.compare.lexstat.LexStat` class used for the computation. It should have two columns indicating cognate IDs. gold : str (default='cogid') The name of the column containing the gold standard cognate assignments. test : str (default='lexstatid') The name of the column containing the automatically implemented cognate assignments. modify_ref : function (default=False) Use a function to modify the reference. If your cognate identifiers are numerical, for example, and negative values are assigned as loans, but you want to suppress this behaviour, just set this keyword to "abs", and all cognate IDs will be converted to their absolute value. pprint : bool (default=True) Print out the results
Returns ------- t : tuple A tuple consisting of the precision, the recall, and the harmonic mean (F-scores).
Notes ----- Pair-scores can be computed in different ways, with often different results. This variant follows the description by :evobib:`Bouchard-Cote2013`.
See also -------- diff bcubes """ # if loans are treated as homologs
# calculate precision and recall
# print the results if this option is chosen
wordlist, gold='cogid', test='lexstatid', modify_ref=False, pprint=True, filename='', tofile=True, transcription="ipa", concepts=False): r""" Write differences in classifications on an item-basis to file.
lex : :py:class:`lingpy.compare.lexstat.LexStat` The :py:class:`~lingpy.compare.lexstat.LexStat` class used for the computation. It should have two columns indicating cognate IDs. gold : str (default='cogid') The name of the column containing the gold standard cognate assignments. test : str (default='lexstatid') The name of the column containing the automatically implemented cognate assignments. modify_ref : function (default=False) Use a function to modify the reference. If your cognate identifiers are numerical, for example, and negative values are assigned as loans, but you want to suppress this behaviour, just set this keyword to "abs", and all cognate IDs will be converted to their absolute value. pprint : bool (default=True) Print out the results filename : str (default='') Name of the output file. If not specified, it is identical with the name of the :py:class:`~lingpy.compare.lexstat.LexStat`, but with the extension ``diff``. tofile : bool (default=True) If set to c{False}, no data will be written to file, but instead, the data will be returned. transcription : str (default="ipa") The file in which the transcriptions are located (should be a string, no segmentized version, for convenience of writing to file).
Returns ------- t : tuple A nested tuple consisting of two further tuples. The first containing precision, recall, and harmonic mean (F-scores), the second containing the same values for the pair-scores.
Notes ----- If the **tofile** option is chosen, the results are written to a specific file with the extension ``diff``. This file contains all cognate sets in which there are differences between gold standard and test sets. It also gives detailed information regarding false positives, false negatives, and the words involved in these wrong decisions.
See also -------- bcubes pairs """
# open file
# concepts, allow to check scores for only one concept
# get a formatter for language names
# get the basic index for all seqs
# calculate the transformation distance of the sets
# calculate the bcubed precision for the sets
# calculate b-cubed recall
# calculate pair precision
concept, fp, fn))
# get the words
# get a word-formater
# write differences to file zip(words, langs, cogsG, cogsT), key=lambda x: (x[2], x[3])): lform.format(lang), wform.format(word), cG, cT)) else:
pprint=pprint)
""" Calculate the n-point average precision.
Parameters ---------- scores : list The scores of your algorithm for pairwise string comparison. cognates : list The cognate codings of the word pairs you compared. 1 indicates that the pair is cognate, 0 indicates that it is not cognate. reverse : bool (default=False) The order of your ranking mechanism. If your algorithm yields high scores for words which are probably cognate, and low scores for non-cognate words, you should set this keyword to "True".
Notes ----- This follows the description in :evobib:`Kondrak2002`. The n-point average precision is useful to compare the discriminative force of different algorithms for string similarity, or to train the parameters of a given algorithm.
Examples --------
>>> scores = [1, 2, 3, 4, 5] >>> cognates = [1, 1, 1, 0, 0] >>> from lingpy.evaluate.acd import npoint_ap >>> npoint_ap(scores, cognates) 1.0
""" key=lambda x: x[0], reverse=reverse)):
""" Populate a wordlist with random cognates for each entry.
Parameters ---------- ref : str (default="randomid") Cognate set identifier for the newly created random cognate sets. bias : str (default=False) When set to "lumper" this will tend to create less cognate sets and larger clusters, when set to "splitter" it will tend to create smaller clusters.
Note ---- When using this method for evaluation, you should be careful to overestimate the results. The function which creates the random clusters is based on simple functions for randomization and thus probably """
"""Return extreme cognates, either lump all words together or split them.
Parameters ---------- wordlist : ~lingpy.basic.wordlist.Wordlist A ~lingpy.basic.wordlist.Wordlist object. ref : str (default="extremeid") The name of the table in your wordlist to which the new IDs should be written. bias : str (default="lumper") If set to "lumper", all words with a certain meaning will be given the same cognate set ID, if set to "splitter", all will be given a separate ID.
"""
|