Words
% This is not a module
\section{General}
\subsection{Words}
The original and primary aim of CLD is to support computational-linguistic study
of multiple languages simultaneously. We would like all languages to
be both readable and writable for all users, but we do not wish to
require users to install hundreds of input methods on their computers,
nor can we make any assumptions about which input methods are already
installed, since there is no standard set across platforms.
One option is to define {\df romanizations}, which convert
ASCII key sequences ({\df ASCII orthography} or {\df romanized text}) to Unicode characters
({\df native orthography}). Character entry
is (or may be) entirely in ASCII, but display is (or may) use the
native script for the language.
There are actually at least three choices where orthography comes to play:
in text input, in text display, and in data
files.
The Universal Dependencies (UD) Treebank uses standard native orthography
(Unicode strings) in data files, and an interest in conforming to
it as an emerging standard gives us motivation to do likewise.
The main potential drawback is that producing romanized text for
editing involves inverting a romanization, and
romanizations are not guaranteed
to be invertible. Nonetheless, it does seem natural to
expect the romanized version of a text to exist and to be unique, so we will
require that all romanizations we use are in fact invertible.
The UD documentation discusses {\df tokenization}, which converts
running {\df plaintext} to {\df tokenized text}. Plaintext is
something of an artificial construct, however. The more general
problem is conversion of arbitrary {\df text documents}
to tokenized text, a more difficult problem, and one that in our model
is part of {\df document importing}.
Our approach within CLD will be to be accommodating to those reading
texts, but require somewhat more of those entering texts.
Entering text is a more specialized action than reading it, so
expecting editors to have more specialized training
seems acceptable.
In particular, we expect texts to be entered in tokenized form, with
word boundaries explicitly indicated.
The main point of text entry is in the text boxes of the
{\tt PlainTextPanel}. We would like to give users a choice of whether
to use ASCII orthography or native orthography, but in either case,
we will assume that word boundaries are unambiguously marked. When
entering Chinese, for example, spaces will be required between words,
even though that is not standard orthography.
The model we adopt for input and internal representation is the
following. Text consists of a sequence of {\df words}. Punctuation
marks are not treated as separate words, rather, associated
with each word is {\df leading punctuation} and {\df trailing punctuation}.
We will also allow the possibility of distinguishing otherwise
identical forms by the addition of a {\df sense number}.
The algorithm for taking input text provided by the user and
converting it to tokenized text is as follows:
\begin{itemize}
\item Split the text at whitespace. Each resulting unit contributes
one word.
\item If the text is romanized, use the romanization to convert it to
native (Unicode) orthography.
\item A {\it word character\/} is defined to be one whose Unicode
category is letter (L), mark (M), number (N), or symbol (S).
Non-word characters have category punctuation (P), control (C), or
space (Z). Characters before the first word character are leading
punctuation and characters after the last word character are
trailing punctuation.
\item If the remaining portion ends with a period followed by one or
more digits, that is further stripped off as representing the
{\df sense number}.
\item What remains is the word. It is an error if it is empty.
\end{itemize}
That is the procedure for processing text coming from the user and
storing it in a tokenized-text file. When the user edits existing
text (rather than entering new text), it must be converted from
tokenized format back to the plaintext format that is presented to the
user in a text box. For that purpose, we must be able to invert the
romanization.
\subsection{Multiple orthographies}
There may be multiple orthographies in use for a given language; for
example, Ojibwe may be written either with Latin letters or with
Canadian Syllabics. Within the Latin orthography, there are
systematic variations. For example, Kimewon uses {\it N\/} where
the standard orthography use {\it nh,\/} and he writes {\it k, t, p\/}
word-finally, even in cases where the standard orthography has {\it g, d, b.\/}
There are also dialectal differences, particularly on the question of
whether to write vowels that are silent due to syncope.
Sporadic spelling differences can be treated simply as variant forms.
The lexicon includes a facility for indicating the canonical form for any given
variant form and conversely getting the list of variant forms for any
given canonical form. That approach becomes onerous for systematic
spelling differences. We cannot simply use different romanizations,
however, because spelling variations are not generally invertible.
For example, it is not possible to (unambiguously) invert Kimewon's
orthographized word-final devoicing.
The approach we take is to adopt a single {\df canonical orthography}
but multiple {\df display orthographies}.
We require those who enter text to cooperate in using a common orthography. One
can then define a {\df respeller} to be a function that maps text in
the canonical orthography to a given display orthography. (Having a
single canonical orthography makes the construction of respellers manageable.)
Unlike romanizations, respellers do not need to be bidirectional.
On the input side, we provide only two
options: direct entry of canonical orthography, or the use of a
default romanization, of which only one is defined for each language.
That is, we require those who enter text to
cooperate on a common romanization for the language.
\subsection{Access points}
The basic unit of representation for tokenized text is the
{\tt TokenBlock}, which represents a translation unit (sentence or
paragraph) in tokenized format.