Other lexica and grammars

Census

The module seal.data.census provides an interface to a list of names from the U.S. Census. A sample of about 6,000,000 census entries was selected, and the distribution of first and last names was computed. The actual sample of last names included only valid last-name entries, so it was somewhat smaller than the original sample of entries. The first-name sample was divided by gender, so that the male sample and female sample contain about 3,000,000 sample points each.

The basic function is get().

>>> from seal.data import census
>>> francis = census.get('Francis')
>>> francis
<Name FRANCIS mr=127 fr=393 lr=385>

The argument to get() is case-insensitive. If the argument is not found in the database, the return value is None.

>>> francis.male.freq
0.16
>>> francis.female.freq
0.039
>>> francis.last.freq
0.029
>>> francis.maleness()
0.8040201005025125

The value for francis.male or francis.female or francis.last is a census.Entry object, which has attributes string, freq, cumfreq, and rank.

A name that occurs at all has values for all three entries. For example, "Morrison" occurs only as a last name:

>>> n = census.get('Morrison')
>>> n.male.freq
0.0
>>> n.female.freq
0.0
>>> n.last.freq
0.048

If the name occurs in none of the samples, census.get() returns None.

One can iterate over all names by calling the function names().

WordNet

WordNet is a lexical database of word senses. A word is represented simply as a string. Each word has one or more senses:

>>> hound = word_senses('hound')
>>> hound
[Synset('hound.n.01'), Synset('cad.n.01'), Synset('hound.v.01')]

In WordNet, senses are called synsets. They are thought of as sets of synonyms, which is to say, words expressing a common sense. We can also go from senses to the words that express them:

>>> words_expressing(hound[0])
['hound', 'hound_dog']
>>> words_expressing(hound[1])
['cad', 'bounder', 'blackguard', 'dog', 'hound', 'heel']
>>> words_expressing(hound[2])
['hound', 'hunt', 'trace']

The senses of a word are implicitly grouped by part of speech. In our example, the noun senses of "hound" come first, and then the verb sense, and this is true in general in WordNet. One can get the group for a particular part of speech by providing the part of speech as a second argument to word_senses():

>>> word_senses('hound', 'v')
[Synset('hound.v.01')]

Senses never cross parts of speech; a given sense belongs to a single part of speech. For example, even if "christening" names the act of christening, its sense is taken to be distinct from that of the verb "christen." (To be precise, WordNet treats "christening" as ambiguous between a noun and gerund, the latter having part of speech v and synonymous with "christen.")

By convention, senses are numbered within part-of-speech groups, beginning at 1. The two nominal senses of "hound" are hound.n.01 and hound.n.02. One can use sense() to retrieve the synset corresponding to a designator, or sense name:

>>> sense('hound.n.01')
Synset('hound.n.01')
>>> sense('hound.n.02')
Synset('cad.n.01')

Each sense name identifies a unique synset, but there may be more than one name for a given synset, and the default name for a synset may differ from the one used to access it, as in the case of hound.n.02. In fact, there are as many names for a synset as there are words expressing the sense:

>>> words_expressing(hound[1])
['cad', 'bounder', 'blackguard', 'dog', 'hound', 'heel']
>>> sense_names(hound[1])
['cad.n.01', 'bounder.n.01', 'blackguard.n.01', 'dog.n.04', '
hound.n.02', 'heel.n.03']

One can confirm that the second nominal sense of "hound" and the fourth nominal sense of "dog" are one and the same:

>>> dog = word_senses('dog', 'n')
>>> dog[3] == hound[1]
True

Words expressing a sense are sorted in approximate order of decreasing frequency. The default name for a sense uses the most-common word that expresses it. However, there is no guarantee that the sense is the first sense of the word. The word "pointer" provides an example:

>>> word_senses('pointer')
[Synset('arrow.n.01'), Synset('pointer.n.02'), Synset('cursor.
n.01'), Synset('pointer.n.04')]
>>> words_expressing(pointer[1])
['pointer']

The second sense of "pointer" can only be expressed as "pointer," so necessarily the word chosen for the default sense name is "pointer." But it is not the dominant sense of the word "pointer."

The name "synset" suggests that a sense is uniquely determined by the set of words that express it, but that is not actually the case. For example, the two senses of "otter" are identical as sets of words, but they are nonetheless distinct senses:

>>> otter = word_senses('otter')
>>> words_expressing(otter[0])
['otter']
>>> words_expressing(otter[1])
['otter']

Their names do differ: the first is otter.n.01 and the second is otter.n.02. WordNet also provides definitions for senses, and in a case like this, the definition is the easiest way to determine the intended distinction.

>>> otter[0].definition
'the fur of an otter'
>>> otter[1].definition
'freshwater carnivorous mammal having webbed and clawed feet 
and dark brown fur'

Incidentally, the official unique identifier for a synset is not the sense name, but its byte offset in the file for the given part of speech. For example, the first sense of "otter" occurs at offset 14765785 of the noun file, and the second sense occurs at offset 2444819 of the same file:

>>> otter[0].pos
'n'
>>> otter[0].offset
14765785
>>> otter[1].pos
'n'
>>> otter[1].offset
2444819

WordNet defines a number of relations among word senses. The most common is the "is-a" relation. The parents of a sense are called its hypernyms(), and its children are called hyponyms(). For example:

>>> otter[0].hypernyms()
[Synset('fur.n.01')]
>>> otter[0].hyponyms()
[]
>>> otter[1].hypernyms()
[Synset('musteline_mammal.n.01')]
>>> otter[1].hyponyms()
[Synset('river_otter.n.01'), Synset('eurasian_otter.n.01')]

Most words have a unique parent, though there is occasional multiple parentage.

English Grammar

First grammars

Grammars 5, 6, and 7 represent a sequence of grammars covering additional phenomena. In each case, there are three files: for example, ex.g5, ex.lex5, and ex.text5.

Grammar 5 adds pronouns and names, noun modification, a richer set of subcategorization, including complements of adjectives, and subordinate clauses.

Root -> S;
Root -> NP;
S -> NP[n:$n] VP[f:$n];
NP[n:$n] -> Pron[n:$n];
NP[n:$n] -> Name[n:$n];
NP[n:$n] -> Det[n:$n] Nom[n:$n];
NP[n:pl] -> NP Conj NP;
Nom[n:$n] -> Adj1 Nom[n:$n];
Nom[n:$n] -> N[n:$n];
VP[f:$f] -> V[f:$f,t:n,s:null];
VP[f:$f] -> V[f:$f,t:y,s:null] NP;
VP[f:$f] -> V[f:$f,t:y,s:$p] NP PP[f:$p];
VP[f:$f] -> V[f:$f,t:y,s:np] NP NP;
VP[f:$f] -> V[f:$f,t:n,s:$p] PP[f:$p];
VP[f:$f] -> V[f:$f,t:n,s:adj] AdjP;
VP[f:$f] -> V[f:$f,t:y,s:$c] NP SC[f:$c];
VP[f:$f] -> V[f:$f,t:n,s:$c] SC[f:$c];
PP[f:$p] -> P[f:$p] NP;
AdjP -> Adj1[s:null];
AdjP -> Adj1[s:$p] PP[f:$p];
Adj1[s:$p] -> Deg Adj[s:$p];
Adj1[s:$p] -> Adj[s:$p];
SC[f:$c] -> C[f:$c] S;
SC[f:inf] -> P[f:to] VP[f:base];

Here are examples of coverage (see text5):

this dog
*this dogs
these dogs
the dog
the dogs
this dog barks
these dogs bark
*these dogs barks
*these dogs bark the cat
these dogs chase the cat
the black dog
Fido chases the cat
he thinks about the dog
she thinks that the dog chases the cat
*she thinks the dog
*she tells that the dog chases the cat
she tells the dog that the cat barks
the cat thinks
the cat wants to bark
Fido is black
the cat is happy about the toy
we gave the dog a toy
we gave a toy to the dog

Grammar 6 adds only one rule to grammar 5:

VP[f:$f] -> Aux[f:$f,t:n,s:$v] VP[f:$v];

This provides coverage of auxiliary verb sequences in English. Here are examples (text6):

Fido chases Spot
Fido has chased Spot
Fido is chasing Spot
Fido will chase Spot
Spot will be chased
Fido will be chasing Spot
Fido will have been chasing Spot
Spot will have been being chased
*Spot will be had been chased

Grammar 7 adds movement: yes-no questions, wh-questions, and relative clauses. Here are examples of its coverage:

what did you chase
which cat did you chase
*what did you bark
did you bark
the dog that Max chased
the dog that chased Max
these black dogs that Max chased

Numbers

One digit numbers. These are simply the digits zero, one, ..., nine. Zero is not embeddable: we cannot say *twenty zero. Let us define digit to exclude zero: it consists of the embeddable digits.

Teens. These are the numbers ten, eleven, ..., nineteen. The category is teen.

Two digit numbers. They begin with twenty: twenty, twenty one, ..., twenty nine, ... ninety nine. Note that "zero" does not count as a digit here: a two-digit number consists of a tens and a digit. When embedded, wherever we can use a two-digit number, a teen or digit can also be used. So we define the category num2 to include two-digit numbers properly speaking, as well as teen and digit.

Hundreds. Examples: one hundred, a hundred, one hundred one, one hundred and one, ... one hundred ninety nine, ... nine hundred ninety nine, eleven hundred ninety nine, ... ninety nine hundred ninety nine. The example ten hundred does not really sound bad; perhaps it should not be excluded.

There is an alternation between "one" and "a," though "a" is not a digit. What follows "hundred" cannot be "a": *one hundred a. Also, when embedding a three-digit number, the form beginning with "a" cannot be used: *two thousand a hundred and six. Let us distinguish between embeddable and non-embeddable three-digit numbers. The non-embeddable case includes the embeddable case, but also ones beginning with "a hundred."

If a number greater than nine precedes the word "hundred," then the result is also not embeddable: we cannot say *six thousand thirteen hundred.

What follows the word "hundred" may be a digit, a teen, or a two-digit number: that is, the class num2. Between "hundred" and num2 there is an optional "and." Let us use the category tail3 for "hundred" followed by optional "and" followed by num2.

Let us call the embeddable case hundreds. The pattern is a digit followed by a tail3. The non-embeddable case is ne-hundreds, which consists of "a" or num2 followed by tail3. Note that the prefix cannot be omitted: *hundred six.

Let us use num3 for the union of hundreds and num2.

Thousands. Examples: a thousand, two thousand, thirteen thousand, ninety nine thousand, six thousand and four, six thousand three hundred, nine hundred ninety nine thousand nine hundred ninety nine. But not *six thousand thirteen hundred or *six thousand ninety nine hundred.

Again, "a" is not permissible when the number is embedded: *four million a thousand three. The pattern for thousands is num3 followed by tail4, where tail4 is the word "thousand" followed by an optional num3. There may be an "and" between "thousand" and the trailing num3.

The non-embeddable case is ne-thousands, which is "a" followed by tail4.

Define num5 to be the union of thousands and num3.

Millions and higher. For higher numbers, the pattern is repeated. Define tail6 to be "million" followed by an optional num5, with an optional "and" in between. Then millions is num3 followed by tail6, and ne-millions is "a" followed by tail6. The union of millions with num5 is num8.

For billions, the helping category is tail9, and the result is num11. For trillions they are tail12 and num14, and so on.

Numbers. The general case, num, is num14 or ne-trillions or ne-billions or ne-millions or ne-thousands or ne-hundreds or ne-digit.

Translation to German

Example

The book known in English as Heidi by Johanna Spyri was originally published in German as two separate volumes: Heidis Lehr-- und Wanderjahre (Gutenberg ebooks 7500) and Heidi kann brauchen, was es gelernt hat (Gutenberg ebooks 7512). This is the beginning of the first chapter of the first volume.

Vom freundlichen Dorfe Maienfeld führt ein Fußweg durch grüne, baumreiche Fluren bis zum Fuße der Höhen, die von dieser Seite groß und ernst auf das Tal herniederschauen. Wo der Fußweg anfängt, beginnt bald Heideland mit dem kurzen Gras und den kräftigen Bergkräutern dem Kommenden entgegenzuduften, denn der Fußweg geht steil und direkt zu den Alpen hinauf.

Before tackling the first sentence, we should be able to handle:

das freundliche Dorf
the friendly village
das Dorf ist freundlich
the village is friendly
vom freundlichen Dorfe
from the friendly village
das Dorf Maienfeld
the village Maienfeld
der Fußweg führt hoch
the path leads up
vom Dorf führt ein Fußweg hoch
a path leads up from the village

There are judgment questions, such as whether to leave place names untranslated, or to attempt to translate them (e.g., Mayfield for Maienfeld).

One of the things that is needed for German is morphology. For pexample, we want to map the category-semantics pair (det.m.dat.sg, ein) to einem and back again. The reverse mapping may be ambiguous.

German morphology

Determiners.
Masc Neut Fem Pl
Nom dieser dieses diese diese
Gen dieses dieses dieser dieser
Dat diesem diesem dieser diesen
Acc diesen dieses diese diese

This declension is also used for jener, mancher. The declension for the definite article differs in neuter singular nom/acc (das, not des) and fem/pl nom/acc (die, not de). The plural is used for viele, beide.

Adjectives. Adjectives have weak and strong declensions: weak in das gute Bier, strong in gutes Bier. The strong declension:
Masc Neut Fem Pl
Nom guter gutes gute gute
Gen guten guten guter guter
Dat gutem gutem guter guten
Acc guten gutes gute gute

The weak declension:

Masc Neut Fem Pl
Nom der gute das gute die gute die guten
Gen des guten des guten der guten der guten
Dat dem guten dem guten der guten den guten
Acc den guten das gute die gute die guten

Internet Dictionary Project

The module seal.data.idp provides an interface to the Internet Dictionary Project (IDP) dictionaries. Dictionaries are available for French (fra), German (deu), Italian (ita), Latin (lat), Portuguese (por), and Spanish (spa). For all except Latin, the keys are English words and the target language translation is the value. For Latin, the keys are Latin words.

Dictionaries are loaded on demand and cached. One may look up individual words as follows:

>>> from seal.data.idp import lookup
>>> lookup('animal', 'deu')
'Tier[Noun]'
>>> lookup('proprius', 'lat')
"one's own, permanent, special, peculiar."

Alternatively, one may fetch the entire dictionary (a dict) and access it directly:

>>> from seal.data.idp import lexicon
>>> latin = lexicon('lat')
>>> latin['amor']
'love, affection, infatuation, passion.'

See IDP.