selkie.tok
— Tokenizer for Latin scripts
Usage
- tokenized(s)
The main function is
tokenized()
, which takes a text (string) and returns a list of tokens.>>> from selkie.tok import tokenized >>> t = tokenized('"Hi, @#!"\nsaid 42-J.\n') >>> t[:4] [<word '"Hi,'>, <punct '@#!"'>, <word 'said'>, <number '42'>]
The return value is a list of Token instances.
- class selkie.tok.Token
- type()
The value is one of:
'word'
,'number'
,'hyphen'
, or'punct'
. (Internally, the tokenizer also creates'space'
and'newline'
tokens, but they are not returned. Newlines are implicit in the line numbers, and spaces can be reconstructed from the column numbers.)
- string()
The original characters.
- line()
The line number (from 1).
- column()
The column number (from 0).
- endcolumn()
The ending column number (exclusive).
- start()
The character offset (from 0, inclusive).
- end()
Ending character offset (exclusive).
For example:
>>> t[5].string() 'J.' >>> t[5].line() 2 >>> t[5].column() 9
One can compute the space between two adjacent tokens by subtracting the endcolumn of the first from the column of the second:
>>> t[5].column() - t[4].endcolumn() 0 >>> t[2].string() 'said' >>> t[3].column() - t[2].endcolumn() 2
This is legitimate only if the two tokens are on the same line.
Algorithm
To be robust to OCR errors, tokens typically mix alphanumeric and punctuation characters. The only exception is that a non-peripheral hyphen will break a token into two pieces. The definition in detail is as follows:
Whitespace is a sequence of one or more characters that satisfy
isspace()
.A hyphen is a sequence of one or more hyphen characters. It is peripheral if it is preceded or followed by whitespace or by the beginning or end of the text. It is embedded if it is not peripheral.
A separator is whitespace or an embedded hyphen. Note that a peripheral hyphen is not a separator, and will be included as part of another token.
A regular token is a maximal sequence of non-separators. Its type is
word
if it contains any letters,number
if it contains digits but no letters, andpunct
if it contains neither letters nor digits.A token is either a regular token or an embedded hyphen.
The tokenizer returns a sequence of tokens. Note that whitespace is discarded. However, the tokenizer does keep track of line numbers; each token has line number as an attribute.