This is a replacement for the previous version.
$ python -m seal.script.panlex2 COM ARG*
Some of the commands are actually multi-word commands, in particular, all the commands beginning with "compile."
The following variables must be set in ~/.seal:
Panlex is a relational database representing lexical information for the world's languages. The information is drawn typically from bilingual dictionaries. Accordingly, a dictionary is viewed as consisting of lexical entries ("meanings"), each of which is the pairing of an expression in the target language with an expression in the glossing language, such as:
boojoo[oji] hello[eng]
Generalizing, multiple target languages and multiple glossing languages are allowed. An example is a multilingual dictionary of several related languages, glossed in both English and French. Viewed this way, there is actually little need to distinguish between target language and glossing language: a lexical entry is simply a set of synonymous expressions in multiple languages.
Panlex includes some additional lexical information, such as parts of speech, properties, definitions, and semantic fields. Definitions and semantic fields are associated with lexical entries, but parts of speech and properties are permitted to differ between a word and its gloss. We should revise the previous example to:
boojoo[oji]/int hello[eng]/int
This lexical entry consists of two fields: boojoo[oj]/int and hello[eng]/int. A field is intrinsic to a lexical entry. Even if an apparently identical field occurs in a different lexical entry, Panlex treats it as a distinct object.
Hence, the main data types are as follows.
An expression is a piece of text that is explicitly labeled with the language it is written in, like "boojoo[oji]." An expression is represented in the database by an expression ID (exid). The ex table associates an exid with a string and language variety.
A field, which Panlex calls a "denotation," contains an expression, has a part of speech ("word class"), and may have properties. A field is represented by a field ID (fid). The expression and lexical entry for a given fid are specified in the dn table. The part of speech is given in the wc table. The list of properties is given in the table md.
A lexical entry, which Panlex calls a "meaning," is represented by a lexical-entry ID (lxid). I use the term lexical entry rather than meaning, because the object in question is dictionary-specific. No attempt is made to identify sameness of meaning across dictionaries. The association between lxid and dictionary is given in the mn table. A lxid may also be associated with a definition, in the df table, or with a semantic domain, in the dm table.
A dictionary, which Panlex calls a "source" or "approver," consists of a list of lexical entries, plus metadata. A dictionary is represented by a dictionary ID (did). The association between did and lxids is given in the table mn, and dictionary metadata is given in the table ap.
A language variety may be documented in multiple dictionaries, and a dictionary may document multiple language varieties. A language variety is represented by a language variety ID (lvid). The Panlex code for a language variety is of form abc-123, consisting of a three-letter iso code for the language and a three-digit variety code. The association between lvids and dids is given in the av table. The iso code and variety code are given in the lv table.
The data-type specifications used in the data tables are as follows. The most important are:
Supporting data types are as follows.
Expressions are used not only for words in dictionaries but also for parts of speech and dictionary names. An expression is a word in a particular language variety. It pairs a string with a language-variety ID.
ex | ||
---|---|---|
ex | exid | The expression. |
lv | lvid | Its language variety. |
tt | str | Its string. |
td | str | A "degraded text" version of the string. Contains only lowercase letters and digits. |
A field belongs to a particular lexical entry, and its contents is an expression.
dn | ||
---|---|---|
dn | fid | The field. |
mn | lxid | The lexical entry it belongs to. |
ex | exid | The contents. |
A part of speech may be assigned to a field.
wc | ||
---|---|---|
wc | num | An ID for the assignment? |
dn | fid | The field. |
ex | exid | The part of speech. |
The wcex table is a convenience listing of the expressions that are used as parts of speech.
wcex | ||
---|---|---|
ex | exid | The part-of-speech expression. |
tt | str | The part-of-speech string. |
A field may have properties (key-value pairs). These are used for declension classes, valency, etc.
md | ||
---|---|---|
md | num | An ID for the assignment? |
dn | fid | The field. |
vb | str | The key. |
vl | str | The value. |
A dictionary is a list of lexical entries. Panlex calls them "meanings."
mn | ||
---|---|---|
mn | lxid | The lexical entry. |
ap | did | The dictionary it belongs to. The table is sorted by this column. |
The df table appears to represent definitions or explanations. Not all dictionaries have them.
df | ||
---|---|---|
df | num | The definition ID (?) |
mn | lxid | The lexical entry. |
lv | lvid | The language variety of the definition text. |
tt | str | The definition text. |
The dm table appears to represent the semantic domain of an entry. Not all dictionaries include it.
dm | ||
---|---|---|
dm | num | The semantic domain (?) |
mn | lxid | The lexical entry. |
ex | exid | An expression naming the semantic domain |
An additional table, mi, also provides information about lexical entries. I have not been able to determine what it represents. The values in the tt field are usually IDs of some sort, but occasionally English words.
mi | ||
---|---|---|
mn | lxid | The lexical entry. |
tt | ? | ? |
A dictionary contains a list of lexical entries (see above). Metadata information is contained in the table ap.
ap | ||
---|---|---|
ap | did | The dictionary ID. |
dt | date | Registration date. |
tt | str | A short identifier, e.g. eng-ciw:Weshki. |
ur | url | The URL. |
bn | str | ISBN, perhaps? |
au | str | Author. |
ti | str | Title. |
pb | str | Publisher. |
yr | str | Year of publication. |
uq | num | Quality? |
ui | did | Appears to be the same as ap. |
ul | str | Some kind of summary line. |
li | lic2 | An IP license code. |
ip | str | An IP license statement. |
co | str | Company? |
ad | str | Email address |
A dictionary documents one or more language varieties.
av | ||
---|---|---|
ap | did | The dictionary. |
lv | lvid | A variety that it documents. |
The apli table appears to map 2-letter license codes to 3-letter codes. I don't know what the codes mean.
apli | ||
---|---|---|
id | num | ID for the assignment (?) |
li | lic2 | 2-letter code |
pl | ? | 3-letter code |
The table af appears to indicate the file format of the original source for the dictionary.
af | ||
---|---|---|
ap | did | The dictionary. |
fm | fm | The format. Example values are html, html-curl, pdf-lock/encrypt, txt, txt-wb, xml, pdf-img, and db. |
The fm table appears to contain information about "fm" codes.
fm | ||
---|---|---|
fm | fm | Format ID? |
tt | str | Dictionary name?? |
md | str | ? |
The table aped appears to contain Panlex processing information for dictionaries.
aped | ||
---|---|---|
ap | did | The dictionary. |
q | bool | ? |
cx | num | ? |
im | bool | ? |
re | bool | ? |
ed | ? | ? |
fp | ? | A code that seems to indicate the documented varieties and a one-word abbreviation of the title. E.g., eng-ciw-Weshki. |
etc | str | Appears to be comments about what work needs to be done yet. |
Languages are identified by 3-digit ISO codes. A language variety is a specialization. The varieties of a given language are numbered from 0: eng0, eng1, etc. There is also a numeric ID for each language variety. For example, variety 187 is eng0.
lv | lvid | The language variety. |
lc | iso | Its ISO language code. |
vc | vc | Language-variety sequence number. The varieties of a particular ISO-coded language are numbered sequentially from 0. |
sy | bool | ? |
am | bool | ? |
ex | exid | The name of the variety. Names are usually given in the variety (e.g., the name for German is given as "Deutsch." But sometimes names are given in English. |
Additional information about language varieties is given in tables cp and cu. I don't know what these tables contain, possibly punctuation characters in the language.
cp | ||
---|---|---|
lv | lvid | A language variety. |
c0 | char | A code point. |
c1 | char | A code point. |
cu | ||
lv | lvid | A language variety. |
c0 | char | A code point. |
c1 | char | A code point. |
loc | ? | ? |
vb | ? | Values include pun, priv, aux, cit:fin:pri, cit:kom:pri. |
One can examine the contents of the original zip file using the zip command. There are four subcommands:
A language is a set of varieties.
$ panlex variety deu lv | lc | vc | sy | am | ex | ex.tt | ex.lv 157 | deu | 0 | t | t | 274 | Deutsch | 157 1349 | deu | 1 | t | t | 18586881 | Masematte | 1349 1845 | deu | 2 | t | t | 18586883 | Hessisch | 1845 9097 | deu | 3 | t | t | 12660638 | doitS | 9097
These are all the language varieties corresponding to ISO code "deu." Language variety 157 is deu0, variety 1349 is deu1, and so on. I don't know what "sy" and "am" are. The name of the variety is given in the variety itself. Specifically, an expression (ex) is the pairing of a string (ex.tt) with an indiciation of which variety it is written in (ex.lv).
To give another example, Ojibwe (oji) is a macrolanguage comprising Severn Ojibwa (ojs), Eastern Ojibwa (ojg), Central Ojibwa (ojc), Northwestern Ojibwa (ojb), Western Ojibwa (ojw), Chippewa (ciw), Ottawa (otw), and Algonquin (alq).
$ panlex variety oji ojs ojg ojc ojb ojw ciw otw alq lv | lc | vc | sy | am | ex | ex.tt | ex.lv 30 | ojb | 0 | t | t | 18592962 | Anishinaabemowin | 30 536 | ciw | 0 | t | t | 18586345 | Anishinaabemowin | 536 934 | otw | 0 | t | t | 18593131 | Daawaamwin | 934 4069 | ojw | 0 | t | t | 18592975 | Nakaw?mowin | 4069 5598 | ojs | 1 | t | t | 7505858 | ????? | 5598 6930 | ojg | 0 | t | t | 18592966 | Nishnaabemwin | 6930 6931 | ojc | 0 | t | t | 18592964 | Ojibwe | 6931 6932 | ojs | 0 | t | t | 18592970 | Anishininiimowin | 6932 6933 | ciw | 1 | t | t | 8150 | Central Minnesota Chippewa | 187 7415 | ciw | 2 | t | t | 17070963 | Minnesota Ojibwe | 187 9170 | alq | 1 | t | t | 241072 | ???????? | 9170 19 | alq | 0 | t | t | 45808 | anicin?bemowin | 19
The question marks represent Unicode characters that Latex does not handle. The information here does not appear to be entirely correct. Panlex labels a wordlist that Margaret and Howard produced as documenting variety 536 (ciw0), which is Chippewa. I would have thought that they speak Eastern Ojibwa.
For each variety, there is a set of dictionaries.
$ panlex dicts 30 536 934 4069 5598 6930 6931 6932 6933 7415 9170 19 128 | Freelang Ojibwe-English dictionary | 13741 | eng-ciw-Weshki 153 | Freelang Ojibwe-English dictionary | 1319 | ciw-ojw-ojc-ojs-ojg-otw-mic-pot-eng-Weshki 611 | Astronomia Terminaro | 2474 | mul-Rapley 2409 | Swadesh Lists | 207 | art-mul-SL 2815 | Anishinaabemowin–English | 131 | ciw-eng-Noori 2830 | Ezhi-Giigidaang, How We Say It (Pronunciation) | 0 | ciw-eng-Kimewon 4091 | Lexique de la langue algonquine | 0 | alq-fra-Cuoq 3778 | Ojibwe Vocabulary Project | 0 | ciw-eng-Manidoons 3779 | Ojibwe-English Wordlist | 0 | ciw-eng-Weshki 4095 | Travels through the Canadas: Vocabulary of the Algonquin Tongue | 0 | alq-eng-Heriot 4144 | The Ojibwe People’s Dictionary | 0 | eng-ciw-OPD
A dictionary may document more than one variety.
To see information about a dictionary:
$ panlex dict 128 ap | lv 128 | 187 128 | 536 id 128 dt 2007-12-11 tt eng-ciw:Weshki ur http://www.freelang.net/dictionary/ojibwe.php bn au Weshki-ayaad; Charles Lippert; Guy T. Gambill ti Freelang Ojibwe-English dictionary pb Freelang yr 2010 uq 5 ui 128 ul TG 122; FreeLang.English_Ojibwe.wb li co ip Every author exercises rights with respect to the part of a list that represents that person’s own contribution. co Guy T. Gambill ad gambillgt1@yahoo.com
The first lines indicate which varieties the dictionary documents. In this case, they are 187 (English, eng0) and 536 (Chippewa, ciw0).
To find out which dictionaries document a particular pair of varieties.
$ panlex bidicts 187 536 128 | Freelang Ojibwe-English dictionary | 13741 | eng-ciw-Weshki 153 | Freelang Ojibwe-English dictionary | 1319 | ciw-ojw-ojc-ojs-ojg-otw-mic-pot-eng-Weshki 611 | Astronomia Terminaro | 2474 | mul-Rapley 2409 | Swadesh Lists | 207 | art-mul-SL 2830 | Ezhi-Giigidaang, How We Say It (Pronunciation) | 0 | ciw-eng-Kimewon 3778 | Ojibwe Vocabulary Project | 0 | ciw-eng-Manidoons 4144 | The Ojibwe People's Dictionary | 0 | eng-ciw-OPD
The columns are: dictionary ID (ap.ap) title (ap.ti), number of entries (count where mn.ap==ap), and short code (aped.fp).
To extract a bidict:
$ panlex bidict 128 536 187 | uniq > tmp.out
The result is ASCII sorted (case sensitive), in two-column format, with a single tab character as column separator. Let us think of the first column as the target language and the second column as the glossing language. If a target-language word has multiple glosses, they produce multiple lines in the file, all sharing the same target-language word. (Since the file is sorted, they form a contiguous block.) For example, the following occurs in the middle of tmp.out:
aabizh cut seams open on aabizhiishin perk up aabiziishin come to aabiziishin revive
For some reason, the dictionaries sometimes contain duplicate entries - hence the "uniq" in the command line above.
f = open_zipfile()
The Panlex zip file is ~/src/cl/panlex-20140501-csv.zip.
Things you can do with a zip file:
f.namelist() # list of filenames f.printdir() # print long listing s = f.read(name) # one of the names from namelist
The entire file is read as a single string.
The list of Panlex files:
>>> from panlex import open_zipfile >>> f = open_zipfile() >>> for nm in f.namelist(): ... print nm ... panlex-20140501-csv/ panlex-20140501-csv/af.csv panlex-20140501-csv/mi.csv panlex-20140501-csv/aped.csv panlex-20140501-csv/df.csv panlex-20140501-csv/wc.csv panlex-20140501-csv/av.csv panlex-20140501-csv/lv.csv panlex-20140501-csv/fm.csv panlex-20140501-csv/ex.csv panlex-20140501-csv/dm.csv panlex-20140501-csv/cp.csv panlex-20140501-csv/md.csv panlex-20140501-csv/dn.csv panlex-20140501-csv/cu.csv panlex-20140501-csv/ap.csv panlex-20140501-csv/wcex.csv panlex-20140501-csv/mn.csv panlex-20140501-csv/apli.csv
Raw contents.
s = raw_contents(fn)
The fn omits the directory name and the .csv suffix. That is, legitimate values are "af," "mi," etc.
Reader.
r = reader(fn)
Uses csv.reader to parse the csv format. The return value is an iterator over records, each record being a list of fields. The first record contains the field names.
>>> from panlex import reader >>> r = reader('af') >>> r.next() ['ap', 'fm'] >>> r.next() ['1636', '24']
Open file.
(hdr, recs) = open_file(fn)
The header is the list of field names, and recs is an iterator over the content records.
Print headers. Prints the database schema: the names and headers of all the files.
>>> from panlex import print_headers >>> print_headers() af: ap fm mi: mn tt aped: ap q cx im re ed fp etc df: df mn lv tt wc: wc dn ex av: ap lv lv: lv lc vc sy am ex fm: fm tt md ex: ex lv tt td dm: dm mn ex cp: lv c0 c1 md: md dn vb vl dn: dn mn ex cu: lv c0 c1 loc vb ap: ap dt tt ur bn au ti pb yr uq ui ul li ip co ad wcex: ex tt mn: mn ap apli: id li pl
Head and cat. The function head() prints the first n records. The function cat() dumps the contents readably. cat(fn,'html') produces HTML output.
Where. Select records containing specified values in a specified field. The return value is an iterator over records.
>>> from panlex import where >>> for r in where('lv', 'lc', 'deu'): ... print '|'.join(r) ... 157|deu|0|t|t|274 1349|deu|1|t|t|18586881 1845|deu|2|t|t|18586883 9097|deu|3|t|t|12660638
Expand expressions.
r = expand_expressions(recs, hdr)
Returns an iterator over records. Two new columns are added: the first contains the expression's string, and the second contains the expression's variety.
Dict entries. The function dict_entry_ids() returns an iterator over the entry IDs (lxids) for a given dictionary or dictionaries.
>>> from panlex import dict_entries >>> len(list(dict_entry_ids('128'))) 13741
The function dict_entry_table() returns a table whose keys are meaning IDs, and whose values are list of pairs of form (lvid, w) where $w$ is a word string.
>>> from panlex import dict_entries >>> ents = dict_entry_table('128') >>> len(ents) 13741 >>> mns = list(ents) >>> mns[0] '2525999' >>> ents[mns[0]] [('187', 'consider'), ('536', 'naagadawaabam')] >>> ents[mns[1]] [('187', 'knock against'), ('536', 'bitaakoshkan')]
Bilex pairs. The function bilex_pairs() returns an alphabetically sorted list of word pairs representing the entries of the given dictionary.
>>> from panlex import bilex_pairs >>> pairs = bilex_pairs('128','536','187') >>> pairs[0] ['Aabamadong', 'Fort Hope'] >>> len(pairs) 13739
Note that the pair of language IDs is not predictable from the dictionary. The dictionary may contain more than two languages, and even if it only contains two, the dictionary does not specify their order.
The database dump is contained in a zip file. The class ZipFile is used to access it.
>>> from seal.data.panlex import ZipFile >>> zf = ZipFile()
Methods are provided for listing the contents of the zip file.
>>> zf.ls() File Name Modified Size panlex-20140501-csv/ 2014-05-01 03:02:18 0 panlex-20140501-csv/af.csv 2014-05-01 03:00:04 38522 panlex-20140501-csv/mi.csv 2014-05-01 03:02:00 33214449 ... >>> list(zf.filenames()) ['af', 'mi', 'aped', 'df', 'wc', 'av', 'lv', 'fm', 'ex', ..., 'apli']
The method print_headers() prints out, for each table, its name and field names. It takes a minute or two to run.
>>> zf.print_headers() af: ap fm mi: mn tt aped: ap q cx im re ed fp etc ...
To print the contents of the tables, the methods head and cat are provided.
>>> zf.head('wcex', 3) ex | tt 3846607 | noun 3846608 | verb >>> zf.cat('wcex') ex | tt 3846607 | noun 3846608 | verb 3846609 | adjv ...
The method table returns a Table object containing the contents of the table. If the table contains an ex field, two new fields named ex.tt and ex.lv are added to each record. This method can be slow to run.
A Table is a collection of records. It has the following members and methods:
A Parser instance digests the information in the tables.
The value of compile is a Compiler instance. It is used to create digested files. If called with no arguments, it creates the files
The function attribute_entries() iterates over the records for a given subject type or a given subject-relation pair. For example:
>>> i = attribute_entries('expression', 'label') >>> i.next() (('expression', 'label', 'string'), '3990756' u'!')
The entries are of form (t, v_1, v_2), where t is of form (t_1, r, t_2).
Collect variety languages. The function collect_variety_languages() iterates over the variety-language records, and constructs a table indexed by variety ID (an int), whose value is the variety's language. E.g.:
>>> vlangs = collect_variety_languages() >>> vlangs[187] 'eng'
Collect approvers. The function collect_approvers() returns a table indexed by approver ID, in which the values are lists of form [lang, variety, quality, title].
Extracting bilexicons. A bilexicon is represented in Python by the class Bilex:
>>> b = Bilex('spa','eng')
Create raw. The first step is to create the raw bilexicon:
>>> b.create_raw()
This takes about 25 minutes to run. The output (in this example) is the file spa-eng-raw.txt in the directory /cl/data/panlex/lex.
The create_raw() method starts by loading the variety-language table, which maps varieties to their languages.
Then it goes through the expression-variety records, creating a table of expressions. The keys are expressions (ints) and the values are lists of form [variety, label, degraded text]. An entry is created only for expressions whose variety's language is one of the two languages of interest. Label and degraded text are initially set to the empty string.
Next it goes through the expression-label and expression-degraded-text records, filling in the other fields of the expression entries.
Next it creates a denotations table. It goes through the denotation-expression records. If the expression has an entry in the expressions table, then a new entry is created in the denotations table. The key is the denotation (an int), and the value is a list of form [expression, part of speech, meaning]. Initially only the expression is set. Part of speech is initialized to the empty string and meaning is initialized to 0.
Next it goes through the denotation-pos records and the denotation-meaning records, filling in the remaining fields in the denotation entries.
By that point, memory is pretty much full. Output is written to lang1-lang2-raw.txt. We pass through the denotations table. Each denotation entry contains an expression ID, we use it to fetch the expression entry. The expression entry contains a variety ID; we use it to look up the language. Each denotation generates one line of output, of form:
m lang v expr degraded pos d e
The single letters represent integer IDs: meaning (m), variety (v), denotation (d), expression (e). The denotation and expression IDs are included only for debugging purposes.
Sort raw. The method sort_raw() calls Unix sort to sort the raw file by meaning, language, variety, and label. The output is written to lang1-lang2-m1.txt. It takes a couple minutes to run.
Create m2. The method create_m2() adds approvers, and also filters out monolingual meanings. (I tried adding approvers when creating the raw file, but Python runs out of memory.)
>>> b.create_m2()
The method scans through the m1.txt file, collecting a table of meanings. For each block of meanings, note is kept of whether both languages are seen. If so, an entry is created in the meanings table, and otherwise no entry is created. The meanings table is indexed by meaning ID, and the value is the approver ID (initialized to 0).
After creating the meanings table, the method passes through the meaning-approver records and sets the values (approvers) for the meanings.
Next it calls collect_approvers() to get the quality information for each approver.
Finally, it passes a second time through the m1.txt file. Each time it encounters a new meaning, it looks in the meanings table to see whether it should be kept or not. If the meaning is a keeper, the quality of the approver is looked up in the approvers table. Each line from m1.txt that is to be kept is copied to m2.txt, and two new fields are added at the end: approver ID and quality. Hence the lines in m2.txt are of form:
m lang v expr degraded pos d e a q
where "a" is approver and "q" is quality (both are ints).
Create sources. The method create_sources() extracts detailed information about each of the approvers. It writes the file lang1-lang2-sources.txt. The line format is:
a rel value
where "a" is the approver ID. The relations (attributes) are: lang, variety, regdate, label, creator, isbn, lic_id, license, year, publ, title, and url. An empty line is inserted before each block of records sharing a common value for "a."
By word. The method by_word() creates a file containing lines of form
word-lang1 quality word-lang2
The method sort_by_word() then sorts that file.
It turns out that the quality scores for the approvers are not very informative about whether the entries are actually good. For example, the top quality source (quality 7) for the Spanish word "a" includes meanings "crazy," "missionary," and "physical" - completely bogus. A much better gauge appears to be the number of sources in which the translation occurs.