Coverage for lingpy/evaluate/apa.py : 93%

Hot-keys on this page
r m x p toggle line displays
j k next/prev highlighted chunk
0 (zero) top of page
1 (one) first highlighted chunk
# *-* coding: utf-8 *-* Basic module for the comparison of automatic phonetic alignments. """
"""Base class for evaluation objects."""
""" Base class for the evaluation of automatic multiple sequence analyses.
Parameters ----------
gold, test : :py:class:`~lingpy.align.sca.MSA` The :py:class:`~lingpy.compare.Multiple` objects which shall be compared. The first object should be the gold standard and the second object should be the test set.
Notes -----
Most of the scores which can be calculated with help of this class are standard evaluation scores in evolutionary biology. For a close description on how these scores are calculated, see, for example, :evobib:`Thompson1999`, :evobib:`List2012`, and :evobib:`Rosenberg2009b`.
See also -------- ~lingpy.evaluate.apa.EvalPSA """ def c_scores(self): """ Calculate the c-scores. """
r""" Calculate the column (C) score.
Parameters ----------
mode : { 1, 2, 3, 4 } Indicate, which mode to compute. Select between:
1. divide the number of common columns in reference and test alignment by the total number of columns in the test alignment (the traditional C score described in :evobib:`Thompson1999`, also known as "precision" score in applications of information retrieval),
2. divide the number of common columns in reference and test alignment by the total number of columns in the reference alignment (also known as "recall" score in applications of information retrieval),
3. divide the number of common columns in reference and test alignment by the average number of columns in reference and test alignment, or
4. combine the scores of mode ``1`` and mode ``2`` by computing their F-score, using the formula :math:`2 * \frac{pr}{p+r}`, where *p* is the precision (mode ``1``) and *r* is the recall (mode ``2``).
Returns ------- score : float The C score for reference and test alignments.
Notes ----- The different c-
See also -------- ~lingpy.evaluate.apa.EvalPSA.c_score
"""
""" Compute the rows (R) score.
Returns -------
score : float The PIR score.
Notes ----- The R score is the number of identical rows (sequences) in reference and test alignment divided by the total number of rows.
See also -------- ~lingpy.evaluate.apa.EvalPSA.r_score """ ''.join(self.gold.alm_matrix[i]) == ''.join(self.test.alm_matrix[i])]
""" Calculate the sum-of-pairs (SP) score.
Parameters ----------
mode : { 1, 2, 3 } Indicate, which mode to compute. Select between:
1. divide the number of common residue pairs in reference and test alignment by the total number of residue pairs in the test alignment (the traditional SP score described in :evobib:`Thompson1999`, also known as "precision" score in applications of information retrieval),
2. divide the number of common residue pairs in reference and test alignment by the total number of residue pairs in the reference alignment (also known as "recall" score in applications of information retrieval),
3. divide the number of common residue pairs in reference and test alignment by the average number of residue pairs in reference and test alignment.
Returns -------
score : float The SP score for gold standard and test alignments.
Notes -----
The SP score (see :evobib:`Thompson1999`) is calculated by dividing the number of identical residue pairs in reference and test alignment by the total number of residue pairs in the reference alignment.
See also -------- ~lingpy.evaluate.apa.EvalPSA.sp_score """
""" Calculate the Jaccard (JC) score.
Returns ------- score : float The JC score.
Notes ----- The Jaccard score (see :evobib:`List2012`) is calculated by dividing the size of the intersection of residue pairs in reference and test alignment by the size of the union of residue pairs in reference and test alignment.
See also -------- lingpy.test.evaluate.EvalPSA.jc_score
"""
""" Calculate msa alignment scores by calculating the pairwise scores. """
# replace all characters by numbers
# select between calculation which is based on an explicit weighting or # a calculation which is based on implicit weighting, explicit # weighting is done by choosing a specific sound class model and # cluster all sequences which are identical, implicit weighting is # done otherwise, i.e. identical (pid = 100) sequences are clustered # into one sequence in order to avoid getting good scores when there # are too many highly identical sequences. # XXX this part of the calculation has never really been testend. I # leave it untouched for the moment, since it won't be activated, # anyway, but we should come back to this and either follow up the idea # or discard the application of weights XXX self.gold._set_model(weights) else:
else:
# change residues by assining each residue a unique status in both MSAs
# start computation by assigning the variables
# start iteration
else: w = 0.0
# speed up the stuff when sequences are identical else: if [x for x in gold if x != (0, 0)] == \ [y for y in test if y != (0, 0)]: pip += 1 * w
crp += len([x for x in gold if x in test and 0 not in x]) * w trp += len([x for x in test if 0 not in x]) * w rrp += len([x for x in gold if 0 not in x]) * w urp += len(set([x for x in gold + test if 0 not in x])) * w gcrp += len([x for x in gold if x in test]) * w gtrp += testL * w grrp += goldL * w
# calculate the scores
""" Check for possibly identical swapped sites.
Returns -------
swap : { -2, -1, 0, 1, 2 } Information regarding the identity of swap decisions is coded by integers, whereas
1 -- indicates that swaps are detected in both gold standard and testset, whereas a negative value indicates that the positions are not identical,
2 -- indicates that swap decisions are not identical in gold standard and testset, whereas a negative value indicates that there is a false positive in the testset, and
0 -- indicates that there are no swaps in the gold standard and the testset. """
if swA == swB: return 1 # swA != swB:
""" Base class for the evaluation of automatic pairwise sequence analyses.
Parameters ----------
gold, test : :py:class:`lingpy.align.sca.PSA` The :py:class:`Pairwise <lingpy.compare.Pairwise>` objects which shall be compared. The first object should be the gold standard and the second object should be the test set.
Notes -----
Moste of the scores which can be calculated with help of this class are standard evaluation scores in evolutionary biology. For a close description on how these scores are calculated, see, for example, :evobib:`Thompson1999`, :evobib:`List2012`, and :evobib:`Rosenberg2009b`.
See also -------- ~lingpy.evaluate.apa.EvalMSA """ """ Compute the percentage of identical rows (PIR) score.
Parameters ----------
mode : { 1, 2 } Select between mode ``1``, where all sequences are compared with each other, and mode ``2``, where only whole alignments are compared.
Returns -------
score : float The PIR score.
Notes ----- The PIR score is the number of identical rows (sequences) in reference and test alignment divided by the total number of rows.
See also -------- ~lingpy.evaluate.apa.EvalMSA.r_score
"""
# half point for each matched item elif mode == 2: if tmp == 2: # mode 2: no half points! score += 1.0
def pairwise_column_scores(self): """ Compute the different column scores for pairwise alignments. The method returns the precision, the recall score, and the f-score, following the proposal of Bergsma and Kondrak (2007), and the column score proposed by Thompson et al. (1999). """ # the variables which store the different counts
# replace all residues in reference and test alignment with ids
# calculate the number of residues in crp, rrp, and trp
# fill in list with exact scores
elif nogaps == 0 and commons == 0: else:
# calculate the scores
""" Calculate column (C) score.
Returns ------- score : float The C score for reference and test alignments.
Notes ----- The C score, as it is described in :evobib:`Thompson1999`, is calculated by dividing the number of columns which are identical in the gold standarad and the test alignment by the total number of columns in the test alignment.
See also -------- ~lingpy.evaluate.EvalMSA.c_score
"""
""" Calculate the sum-of-pairs (SP) score.
Returns -------
score : float The SP score for reference and test alignments.
Notes -----
The SP score (see :evobib:`Thompson1999`) is calculated by dividing the number of identical residue pairs in reference and test alignment by the total number of residue pairs in the reference alignment.
See also -------- ~lingpy.evaluate.EvalMSA.sp_score
"""
""" Calculate the Jaccard (JC) score.
Returns ------- score : float The JC score.
Notes -----
The Jaccard score (see :evobib:`List2012`) is calculated by dividing the size of the intersection of residue pairs in reference and test alignment by the size of the union of residue pairs in reference and test alignment.
See also -------- ~lingpy.evaluate.EvalMSA.jc_score
"""
""" Write all differences between two sets to a file.
Parameters ----------
filename : str (default='eval_psa_diff') Default
"""
seq_id, taxA, '\t'.join(g1), taxB, '\t'.join(g2), '{0}\t{1}'.format( taxlen * ' ', '\t'.join(['==' for x in range(maxL)])), '\t'.join(t1), '\t'.join(t2), )) |