Aligner
This module implements the aligner.
- class Data(score: float)
Private data class for the Needleman-Wunsch+Gotoh sequence aligner.
- __init__(score: float)
- score: float
The current score.
- p: float
\(P_{m,n}\) in [Gotoh1982].
- q: float
\(Q_{m,n}\) in [Gotoh1982].
- pSize: int
The size of the p gap. \(k\) in [Gotoh1982].
- qSize: int
The size of the q gap. \(k\) in [Gotoh1982].
- class Aligner(start_score: float = - 1.0, open_score: float = - 1.0, extend_score: float = - 0.5)
A generic Needleman-Wunsch+Gotoh sequence aligner.
This implementation uses Gotoh’s improvements to get \(\mathcal{O}(mn)\) running time and reduce memory requirements to essentially the backtracking matrix only. In Gotoh’s technique the gap weight formula must be of the special form \(w_k = uk + v\) (affine gap). \(k\) is the gap size, \(v\) is the gap opening score and \(u\) the gap extension score.
The aligner is type-agnostic and expects only to call the method
Strategy.similarity()
on the given strategy.- __init__(start_score: float = - 1.0, open_score: float = - 1.0, extend_score: float = - 0.5)
- start_score: float
The gap opening score at the start of the string. Set this to 0 to find local alignments.
- open_score: float
The gap opening score \(v\).
- extend_score: float
The gap extension score \(u\).
- align(strategy: super_collator.strategy.Strategy[super_collator.token.TT], tokens_a: Sequence[super_collator.token.Token[super_collator.token.TT]], tokens_b: Sequence[super_collator.token.Token[super_collator.token.TT]]) Tuple[Sequence[super_collator.token.Token[super_collator.token.TT]], float]
Align two sequences.
- Returns
the aligned sequence (of MultiTokens) and the score
- build_debug_matrix(matrix: List[List[super_collator.aligner.Data]], len_matrix: List[List[int]], ts_a: Sequence[super_collator.token.Token[super_collator.token.TT]], ts_b: Sequence[super_collator.token.Token[super_collator.token.TT]]) str
Build a human-readable debug matrix.
- Parameters
matrix – the full scoring matrix
len_matrix – the backtracking matrix
ts_a – the first aligned string
ts_b – the second aligned string
- Return str
the debug matrix as human readable string