Strategy
This module implements the token comparison strategies.
- bit_count(x: int) int
Count the set bits in the given integer.
Replacement for bit_count in Python < 3.10. See: [Warren2013], chapter 5, page 81ff.
- class Strategy
Base class for all classes that implement a strategy.
When the aligner wants to compare two tokens, it calls the method
similarity()
. This method should return the score of the alignment. The score should increase with the desirability of the alignment, but otherwise there are no fixed rules.The score must harmonize with the penalties for inserting gaps (these are set in the aligner). If the score for opening a gap is -1.0 (the default) then a satisfactory match should return a score > 1.0.
Subclasses will implement different algorithms, like consulting a PAM or BLOSUM matrix, or computing a hamming distance between the input tokens.
Auxiliary input needed for similarity calculation may be stored in
user_data
. Eg. you can store a POS-tag intouser_data
and write a strategy that uses the POS-tag while computing the similarity.The method
preprocess()
is called once on every token before the aligner starts working. The strategy may then precompute some values and store them instrategy_data
. The total time spent in preprocessing will be linear in \(\mathcal{O}(n+m)\) while the total time spent in alignment is quadratic in \(\mathcal{O}(nm)\), with \(n\) and \(m\) being the lengths of the two strings to be aligned.- preprocess(a: super_collator.token.Token[super_collator.token.TT]) None
Preprocess a token.
This function is called once for each token to give us a chance to preprocess it into a more easily comparable form.
- similarity(a: super_collator.token.Token[super_collator.token.TT], b: super_collator.token.Token[super_collator.token.TT]) float
Return similarity between two tokens.
- class StringEqualsStrategy
Calculates the similarity of two string tokens by string equality.
- similarity(a: super_collator.token.Token[super_collator.token.TT], b: super_collator.token.Token[super_collator.token.TT]) float
Return 1.0 if the strings are equal, 0.0 otherwise.
- class CommonNgramsStrategy(n: int = 2)
Calculates the similarity of two string tokens by common N-grams.
The similarity score is 4 times the count of common Ngrams divided by the count of all Ngrams.
- __init__(n: int = 2)
- similarity(a: super_collator.token.Token[super_collator.token.TT], b: super_collator.token.Token[super_collator.token.TT]) float
Return similarity between two tokens.
- static split_ngrams(s: str, n: int) List[str]
Split a string into ngrams.
- preprocess(a: super_collator.token.Token[str]) None
Preprocess a token.
This function is called once for each token to give us a chance to preprocess it into a more easily comparable form.