selkie.pretts
— Converting text to bare words
A “bare words” representation is intended to be a natural data format for the boundary between the language system properly speaking and the “media” system, meaning in particular speech versus writing. A bare words representation is intended to be neutral between spoken and written language. It consists only of words with no orthographic or typographic modifications: no punctuation, capitalization, abbreviations, arabic numbers, and so on. It is the kind of representation that is typical output of a speech recognizer, and it is suitable as input to a speech synthesizer. It is also particularly suitable for lexical lookup and parsing.
One obtains bare words from conventional text by tokenization and normalization. Selkie.pretts provides the normalization step.
- class Normalizer(text)
The text is represented as a single string. The Normalizer instance behaves as an iteration containing normalized tokens.
The normalizer uses a table of abbreviations, which is selkie/abbreviations in the selkie.data directory.