The module seal.data.census provides an interface to a list of names from the U.S. Census. A sample of about 6,000,000 census entries was selected, and the distribution of first and last names was computed. The actual sample of last names included only valid last-name entries, so it was somewhat smaller than the original sample of entries. The first-name sample was divided by gender, so that the male sample and female sample contain about 3,000,000 sample points each.
The basic function is get().
>>> from seal.data import census >>> francis = census.get('Francis') >>> francis <Name FRANCIS mr=127 fr=393 lr=385>
The argument to get() is case-insensitive. If the argument is not found in the database, the return value is None.
>>> francis.male.freq 0.16 >>> francis.female.freq 0.039 >>> francis.last.freq 0.029 >>> francis.maleness() 0.8040201005025125
The value for francis.male or francis.female or francis.last is a census.Entry object, which has attributes string, freq, cumfreq, and rank.
The string is the same as n.string, where n is the Name object.
>>> francis.string 'FRANCIS'
The freq is in percent. It represents the percent of people in the sample whose name was the one given. Note that a last name with frequency of 1% corresponds to an absolute count of about 60,000 (out of 6,000,000), whereas a first name with frequency 1% corresponds to an absolute count of about 30,000 (out of 3,000,000).
The cumfreq is in percent.
>>> francis.male.cumfreq 64.25
The rank is an integer; the most-frequent name in the relevant sample has rank 1.
>>> francis.male.rank 127
The method maleness() returns the conditional probability that the name is male, given that it is a first name. That is, if m is the frequency in the male entry, and f is the frequency in the female entry, maleness is m/(m+f). (If the name never occurs as a first name, maleness defaults to 0.5.)
>>> jordan = census.get('jordan') >>> jordan.maleness() 0.8235294117647058
A name that occurs at all has values for all three entries. For example, "Morrison" occurs only as a last name:
>>> n = census.get('Morrison') >>> n.male.freq 0.0 >>> n.female.freq 0.0 >>> n.last.freq 0.048
If the name occurs in none of the samples, census.get() returns None.
One can iterate over all names by calling the function names().
WordNet is a lexical database of word senses. A word is represented simply as a string. Each word has one or more senses:
>>> hound = word_senses('hound') >>> hound [Synset('hound.n.01'), Synset('cad.n.01'), Synset('hound.v.01')]
In WordNet, senses are called synsets. They are thought of as sets of synonyms, which is to say, words expressing a common sense. We can also go from senses to the words that express them:
>>> words_expressing(hound[0]) ['hound', 'hound_dog'] >>> words_expressing(hound[1]) ['cad', 'bounder', 'blackguard', 'dog', 'hound', 'heel'] >>> words_expressing(hound[2]) ['hound', 'hunt', 'trace']
The senses of a word are implicitly grouped by part of speech. In our example, the noun senses of "hound" come first, and then the verb sense, and this is true in general in WordNet. One can get the group for a particular part of speech by providing the part of speech as a second argument to word_senses():
>>> word_senses('hound', 'v') [Synset('hound.v.01')]
Senses never cross parts of speech; a given sense belongs to a single part of speech. For example, even if "christening" names the act of christening, its sense is taken to be distinct from that of the verb "christen." (To be precise, WordNet treats "christening" as ambiguous between a noun and gerund, the latter having part of speech v and synonymous with "christen.")
By convention, senses are numbered within part-of-speech groups, beginning at 1. The two nominal senses of "hound" are hound.n.01 and hound.n.02. One can use sense() to retrieve the synset corresponding to a designator, or sense name:
>>> sense('hound.n.01') Synset('hound.n.01') >>> sense('hound.n.02') Synset('cad.n.01')
Each sense name identifies a unique synset, but there may be more than one name for a given synset, and the default name for a synset may differ from the one used to access it, as in the case of hound.n.02. In fact, there are as many names for a synset as there are words expressing the sense:
>>> words_expressing(hound[1]) ['cad', 'bounder', 'blackguard', 'dog', 'hound', 'heel'] >>> sense_names(hound[1]) ['cad.n.01', 'bounder.n.01', 'blackguard.n.01', 'dog.n.04', ' hound.n.02', 'heel.n.03']
One can confirm that the second nominal sense of "hound" and the fourth nominal sense of "dog" are one and the same:
>>> dog = word_senses('dog', 'n') >>> dog[3] == hound[1] True
Words expressing a sense are sorted in approximate order of decreasing frequency. The default name for a sense uses the most-common word that expresses it. However, there is no guarantee that the sense is the first sense of the word. The word "pointer" provides an example:
>>> word_senses('pointer') [Synset('arrow.n.01'), Synset('pointer.n.02'), Synset('cursor. n.01'), Synset('pointer.n.04')] >>> words_expressing(pointer[1]) ['pointer']
The second sense of "pointer" can only be expressed as "pointer," so necessarily the word chosen for the default sense name is "pointer." But it is not the dominant sense of the word "pointer."
The name "synset" suggests that a sense is uniquely determined by the set of words that express it, but that is not actually the case. For example, the two senses of "otter" are identical as sets of words, but they are nonetheless distinct senses:
>>> otter = word_senses('otter') >>> words_expressing(otter[0]) ['otter'] >>> words_expressing(otter[1]) ['otter']
Their names do differ: the first is otter.n.01 and the second is otter.n.02. WordNet also provides definitions for senses, and in a case like this, the definition is the easiest way to determine the intended distinction.
>>> otter[0].definition 'the fur of an otter' >>> otter[1].definition 'freshwater carnivorous mammal having webbed and clawed feet and dark brown fur'
Incidentally, the official unique identifier for a synset is not the sense name, but its byte offset in the file for the given part of speech. For example, the first sense of "otter" occurs at offset 14765785 of the noun file, and the second sense occurs at offset 2444819 of the same file:
>>> otter[0].pos 'n' >>> otter[0].offset 14765785 >>> otter[1].pos 'n' >>> otter[1].offset 2444819
WordNet defines a number of relations among word senses. The most common is the "is-a" relation. The parents of a sense are called its hypernyms(), and its children are called hyponyms(). For example:
>>> otter[0].hypernyms() [Synset('fur.n.01')] >>> otter[0].hyponyms() [] >>> otter[1].hypernyms() [Synset('musteline_mammal.n.01')] >>> otter[1].hyponyms() [Synset('river_otter.n.01'), Synset('eurasian_otter.n.01')]
Most words have a unique parent, though there is occasional multiple parentage.
Grammars 5, 6, and 7 represent a sequence of grammars covering additional phenomena. In each case, there are three files: for example, ex.g5, ex.lex5, and ex.text5.
Grammar 5 adds pronouns and names, noun modification, a richer set of subcategorization, including complements of adjectives, and subordinate clauses.
Root -> S; Root -> NP; S -> NP[n:$n] VP[f:$n]; NP[n:$n] -> Pron[n:$n]; NP[n:$n] -> Name[n:$n]; NP[n:$n] -> Det[n:$n] Nom[n:$n]; NP[n:pl] -> NP Conj NP; Nom[n:$n] -> Adj1 Nom[n:$n]; Nom[n:$n] -> N[n:$n]; VP[f:$f] -> V[f:$f,t:n,s:null]; VP[f:$f] -> V[f:$f,t:y,s:null] NP; VP[f:$f] -> V[f:$f,t:y,s:$p] NP PP[f:$p]; VP[f:$f] -> V[f:$f,t:y,s:np] NP NP; VP[f:$f] -> V[f:$f,t:n,s:$p] PP[f:$p]; VP[f:$f] -> V[f:$f,t:n,s:adj] AdjP; VP[f:$f] -> V[f:$f,t:y,s:$c] NP SC[f:$c]; VP[f:$f] -> V[f:$f,t:n,s:$c] SC[f:$c]; PP[f:$p] -> P[f:$p] NP; AdjP -> Adj1[s:null]; AdjP -> Adj1[s:$p] PP[f:$p]; Adj1[s:$p] -> Deg Adj[s:$p]; Adj1[s:$p] -> Adj[s:$p]; SC[f:$c] -> C[f:$c] S; SC[f:inf] -> P[f:to] VP[f:base];
Here are examples of coverage (see text5):
this dog *this dogs these dogs the dog the dogs this dog barks these dogs bark *these dogs barks *these dogs bark the cat these dogs chase the cat the black dog Fido chases the cat he thinks about the dog she thinks that the dog chases the cat *she thinks the dog *she tells that the dog chases the cat she tells the dog that the cat barks the cat thinks the cat wants to bark Fido is black the cat is happy about the toy we gave the dog a toy we gave a toy to the dog
Grammar 6 adds only one rule to grammar 5:
VP[f:$f] -> Aux[f:$f,t:n,s:$v] VP[f:$v];
This provides coverage of auxiliary verb sequences in English. Here are examples (text6):
Fido chases Spot Fido has chased Spot Fido is chasing Spot Fido will chase Spot Spot will be chased Fido will be chasing Spot Fido will have been chasing Spot Spot will have been being chased *Spot will be had been chased
Grammar 7 adds movement: yes-no questions, wh-questions, and relative clauses. Here are examples of its coverage:
what did you chase which cat did you chase *what did you bark did you bark the dog that Max chased the dog that chased Max these black dogs that Max chased
One digit numbers. These are simply the digits zero, one, ..., nine. Zero is not embeddable: we cannot say *twenty zero. Let us define digit to exclude zero: it consists of the embeddable digits.
Teens. These are the numbers ten, eleven, ..., nineteen. The category is teen.
Two digit numbers. They begin with twenty: twenty, twenty one, ..., twenty nine, ... ninety nine. Note that "zero" does not count as a digit here: a two-digit number consists of a tens and a digit. When embedded, wherever we can use a two-digit number, a teen or digit can also be used. So we define the category num2 to include two-digit numbers properly speaking, as well as teen and digit.
Hundreds. Examples: one hundred, a hundred, one hundred one, one hundred and one, ... one hundred ninety nine, ... nine hundred ninety nine, eleven hundred ninety nine, ... ninety nine hundred ninety nine. The example ten hundred does not really sound bad; perhaps it should not be excluded.
There is an alternation between "one" and "a," though "a" is not a digit. What follows "hundred" cannot be "a": *one hundred a. Also, when embedding a three-digit number, the form beginning with "a" cannot be used: *two thousand a hundred and six. Let us distinguish between embeddable and non-embeddable three-digit numbers. The non-embeddable case includes the embeddable case, but also ones beginning with "a hundred."
If a number greater than nine precedes the word "hundred," then the result is also not embeddable: we cannot say *six thousand thirteen hundred.
What follows the word "hundred" may be a digit, a teen, or a two-digit number: that is, the class num2. Between "hundred" and num2 there is an optional "and." Let us use the category tail3 for "hundred" followed by optional "and" followed by num2.
Let us call the embeddable case hundreds. The pattern is a digit followed by a tail3. The non-embeddable case is ne-hundreds, which consists of "a" or num2 followed by tail3. Note that the prefix cannot be omitted: *hundred six.
Let us use num3 for the union of hundreds and num2.
Thousands. Examples: a thousand, two thousand, thirteen thousand, ninety nine thousand, six thousand and four, six thousand three hundred, nine hundred ninety nine thousand nine hundred ninety nine. But not *six thousand thirteen hundred or *six thousand ninety nine hundred.
Again, "a" is not permissible when the number is embedded: *four million a thousand three. The pattern for thousands is num3 followed by tail4, where tail4 is the word "thousand" followed by an optional num3. There may be an "and" between "thousand" and the trailing num3.
The non-embeddable case is ne-thousands, which is "a" followed by tail4.
Define num5 to be the union of thousands and num3.
Millions and higher. For higher numbers, the pattern is repeated. Define tail6 to be "million" followed by an optional num5, with an optional "and" in between. Then millions is num3 followed by tail6, and ne-millions is "a" followed by tail6. The union of millions with num5 is num8.
For billions, the helping category is tail9, and the result is num11. For trillions they are tail12 and num14, and so on.
Numbers. The general case, num, is num14 or ne-trillions or ne-billions or ne-millions or ne-thousands or ne-hundreds or ne-digit.
The book known in English as Heidi by Johanna Spyri was originally published in German as two separate volumes: Heidis Lehr-- und Wanderjahre (Gutenberg ebooks 7500) and Heidi kann brauchen, was es gelernt hat (Gutenberg ebooks 7512). This is the beginning of the first chapter of the first volume.
Vom freundlichen Dorfe Maienfeld führt ein Fußweg durch grüne, baumreiche Fluren bis zum Fuße der Höhen, die von dieser Seite groß und ernst auf das Tal herniederschauen. Wo der Fußweg anfängt, beginnt bald Heideland mit dem kurzen Gras und den kräftigen Bergkräutern dem Kommenden entgegenzuduften, denn der Fußweg geht steil und direkt zu den Alpen hinauf.
Before tackling the first sentence, we should be able to handle:
das freundliche Dorf
the friendly village
das Dorf ist freundlich
the village is friendly
vom freundlichen Dorfe
from the friendly village
das Dorf Maienfeld
the village Maienfeld
der Fußweg führt hoch
the path leads up
vom Dorf führt ein Fußweg hoch
a path leads up from the village
There are judgment questions, such as whether to leave place names untranslated, or to attempt to translate them (e.g., Mayfield for Maienfeld).
One of the things that is needed for German is morphology. For pexample, we want to map the category-semantics pair (det.m.dat.sg, ein) to einem and back again. The reverse mapping may be ambiguous.
Determiners.
Masc | Neut | Fem | Pl | |
---|---|---|---|---|
Nom | dieser | dieses | diese | diese |
Gen | dieses | dieses | dieser | dieser |
Dat | diesem | diesem | dieser | diesen |
Acc | diesen | dieses | diese | diese |
This declension is also used for jener, mancher. The declension for the definite article differs in neuter singular nom/acc (das, not des) and fem/pl nom/acc (die, not de). The plural is used for viele, beide.
Adjectives. Adjectives have weak and strong declensions: weak in das gute Bier, strong in gutes Bier. The strong declension:
Masc | Neut | Fem | Pl | |
---|---|---|---|---|
Nom | guter | gutes | gute | gute |
Gen | guten | guten | guter | guter |
Dat | gutem | gutem | guter | guten |
Acc | guten | gutes | gute | gute |
The weak declension:
Masc | Neut | Fem | Pl | |
---|---|---|---|---|
Nom | der gute | das gute | die gute | die guten |
Gen | des guten | des guten | der guten | der guten |
Dat | dem guten | dem guten | der guten | den guten |
Acc | den guten | das gute | die gute | die guten |
The module seal.data.idp provides an interface to the Internet Dictionary Project (IDP) dictionaries. Dictionaries are available for French (fra), German (deu), Italian (ita), Latin (lat), Portuguese (por), and Spanish (spa). For all except Latin, the keys are English words and the target language translation is the value. For Latin, the keys are Latin words.
Dictionaries are loaded on demand and cached. One may look up individual words as follows:
>>> from seal.data.idp import lookup >>> lookup('animal', 'deu') 'Tier[Noun]' >>> lookup('proprius', 'lat') "one's own, permanent, special, peculiar."
Alternatively, one may fetch the entire dictionary (a dict) and access it directly:
>>> from seal.data.idp import lexicon >>> latin = lexicon('lat') >>> latin['amor'] 'love, affection, infatuation, passion.'
See IDP.