Panlex

Panlex2

This is a replacement for the previous version.

Usage

$ python -m seal.script.panlex2 COM ARG*

Some of the commands are actually multi-word commands, in particular, all the commands beginning with "compile."

lang CODE
CODE is an ISO 639-3 language code. Prints out information about all varieties of the language with the given code. The printout includes the LVID code for each variety and the list of dictionaries (SIDs) for each variety.
lvid LVID
Produces the same output as lang, but limited to a single variety.
dict SID
Prints out metadata for the dictionary whose "source ID" is SID.
compile varieties
Writes varieties.tab. Needed for e.g. compile bilex.
compile bilex TGT [GLS]
TGT and GLS must be language variety IDs. If GLS is not given it defaults to 187 (English). Writes the file bilex-TLVID-GLVID.tab, which contains records of form tgt_str gloss_str sids, where sids is a space-separate list of source IDs.

Environment

The following variables must be set in ~/.seal:

data.panlex.zipfn
The pathname of the Panlex zip file. It may begin with "~".
data.panlex.dirname
The toplevel directory in the zip file, e.g. "panlex-20190901-csv".
data.panlex.tgtdir
The directory in which to install compiled dictionaries, etc. It may begin with "~".

Overview

Panlex is a relational database representing lexical information for the world's languages. The information is drawn typically from bilingual dictionaries. Accordingly, a dictionary is viewed as consisting of lexical entries ("meanings"), each of which is the pairing of an expression in the target language with an expression in the glossing language, such as:

boojoo[oji] hello[eng]

Generalizing, multiple target languages and multiple glossing languages are allowed. An example is a multilingual dictionary of several related languages, glossed in both English and French. Viewed this way, there is actually little need to distinguish between target language and glossing language: a lexical entry is simply a set of synonymous expressions in multiple languages.

Panlex includes some additional lexical information, such as parts of speech, properties, definitions, and semantic fields. Definitions and semantic fields are associated with lexical entries, but parts of speech and properties are permitted to differ between a word and its gloss. We should revise the previous example to:

boojoo[oji]/int hello[eng]/int

This lexical entry consists of two fields: boojoo[oj]/int and hello[eng]/int. A field is intrinsic to a lexical entry. Even if an apparently identical field occurs in a different lexical entry, Panlex treats it as a distinct object.

Hence, the main data types are as follows.

Data tables

Data types

The data-type specifications used in the data tables are as follows. The most important are:

Supporting data types are as follows.

Expressions

Expressions are used not only for words in dictionaries but also for parts of speech and dictionary names. An expression is a word in a particular language variety. It pairs a string with a language-variety ID.

ex
ex exid The expression.
lv lvid Its language variety.
tt str Its string.
td str A "degraded text" version of the string. Contains only lowercase letters and digits.

Fields

A field belongs to a particular lexical entry, and its contents is an expression.

dn
dn fid The field.
mn lxid The lexical entry it belongs to.
ex exid The contents.

A part of speech may be assigned to a field.

wc
wc num An ID for the assignment?
dn fid The field.
ex exid The part of speech.

The wcex table is a convenience listing of the expressions that are used as parts of speech.

wcex
ex exid The part-of-speech expression.
tt str The part-of-speech string.

A field may have properties (key-value pairs). These are used for declension classes, valency, etc.

md
md num An ID for the assignment?
dn fid The field.
vb str The key.
vl str The value.

Lexical entries

A dictionary is a list of lexical entries. Panlex calls them "meanings."

mn
mn lxid The lexical entry.
ap did The dictionary it belongs to. The table is sorted by this column.

The df table appears to represent definitions or explanations. Not all dictionaries have them.

df
df num The definition ID (?)
mn lxid The lexical entry.
lv lvid The language variety of the definition text.
tt str The definition text.

The dm table appears to represent the semantic domain of an entry. Not all dictionaries include it.

dm
dm num The semantic domain (?)
mn lxid The lexical entry.
ex exid An expression naming the semantic domain

An additional table, mi, also provides information about lexical entries. I have not been able to determine what it represents. The values in the tt field are usually IDs of some sort, but occasionally English words.

mi
mn lxid The lexical entry.
tt ? ?

Dictionaries

A dictionary contains a list of lexical entries (see above). Metadata information is contained in the table ap.

ap
ap did The dictionary ID.
dt date Registration date.
tt str A short identifier, e.g. eng-ciw:Weshki.
ur url The URL.
bn str ISBN, perhaps?
au str Author.
ti str Title.
pb str Publisher.
yr str Year of publication.
uq num Quality?
ui did Appears to be the same as ap.
ul str Some kind of summary line.
li lic2 An IP license code.
ip str An IP license statement.
co str Company?
ad str Email address

A dictionary documents one or more language varieties.

av
ap did The dictionary.
lv lvid A variety that it documents.

The apli table appears to map 2-letter license codes to 3-letter codes. I don't know what the codes mean.

apli
id num ID for the assignment (?)
li lic2 2-letter code
pl ? 3-letter code

The table af appears to indicate the file format of the original source for the dictionary.

af
ap did The dictionary.
fm fm The format. Example values are html, html-curl, pdf-lock/encrypt, txt, txt-wb, xml, pdf-img, and db.

The fm table appears to contain information about "fm" codes.

fm
fm fm Format ID?
tt str Dictionary name??
md str ?

The table aped appears to contain Panlex processing information for dictionaries.

aped
ap did The dictionary.
q bool ?
cx num ?
im bool ?
re bool ?
ed ? ?
fp ? A code that seems to indicate the documented varieties and a one-word abbreviation of the title. E.g., eng-ciw-Weshki.
etc str Appears to be comments about what work needs to be done yet.

Language varieties

Languages are identified by 3-digit ISO codes. A language variety is a specialization. The varieties of a given language are numbered from 0: eng0, eng1, etc. There is also a numeric ID for each language variety. For example, variety 187 is eng0.

lv lvid The language variety.
lc iso Its ISO language code.
vc vc Language-variety sequence number. The varieties of a particular ISO-coded language are numbered sequentially from 0.
sy bool ?
am bool ?
ex exid The name of the variety. Names are usually given in the variety (e.g., the name for German is given as "Deutsch." But sometimes names are given in English.

Additional information about language varieties is given in tables cp and cu. I don't know what these tables contain, possibly punctuation characters in the language.

cp
lv lvid A language variety.
c0 char A code point.
c1 char A code point.
cu
lv lvid A language variety.
c0 char A code point.
c1 char A code point.
loc ? ?
vb ? Values include pun, priv, aux, cit:fin:pri, cit:kom:pri.

Panlex executable

Zip

One can examine the contents of the original zip file using the zip command. There are four subcommands:

list
List the filenames.
head f
Print the first 50 records of file f.
cat f
Print all the records of file f.
table f
The table is like the contents, except that, if there is a field labeled ex, two new columns are added: ex.tt and ex.lv. The former contains the string contents of the expression and the latter is the language-variety code for the expression. One may optionally provide an attribute a and value v to restrict the listing to records that have value v for attribute a. Nota bene: this command is generally much slower than cat.

Variety

A language is a set of varieties.

$ panlex variety deu
lv | lc | vc | sy | am | ex | ex.tt | ex.lv
157 | deu | 0 | t | t | 274 | Deutsch | 157
1349 | deu | 1 | t | t | 18586881 | Masematte | 1349
1845 | deu | 2 | t | t | 18586883 | Hessisch | 1845
9097 | deu | 3 | t | t | 12660638 | doitS | 9097

These are all the language varieties corresponding to ISO code "deu." Language variety 157 is deu0, variety 1349 is deu1, and so on. I don't know what "sy" and "am" are. The name of the variety is given in the variety itself. Specifically, an expression (ex) is the pairing of a string (ex.tt) with an indiciation of which variety it is written in (ex.lv).

To give another example, Ojibwe (oji) is a macrolanguage comprising Severn Ojibwa (ojs), Eastern Ojibwa (ojg), Central Ojibwa (ojc), Northwestern Ojibwa (ojb), Western Ojibwa (ojw), Chippewa (ciw), Ottawa (otw), and Algonquin (alq).

$ panlex variety oji ojs ojg ojc ojb ojw ciw otw alq
lv | lc | vc | sy | am | ex | ex.tt | ex.lv
30 | ojb | 0 | t | t | 18592962 | Anishinaabemowin | 30
536 | ciw | 0 | t | t | 18586345 | Anishinaabemowin | 536
934 | otw | 0 | t | t | 18593131 | Daawaamwin | 934
4069 | ojw | 0 | t | t | 18592975 | Nakaw?mowin | 4069
5598 | ojs | 1 | t | t | 7505858 | ????? | 5598
6930 | ojg | 0 | t | t | 18592966 | Nishnaabemwin | 6930
6931 | ojc | 0 | t | t | 18592964 | Ojibwe | 6931
6932 | ojs | 0 | t | t | 18592970 | Anishininiimowin | 6932
6933 | ciw | 1 | t | t | 8150 | Central Minnesota Chippewa | 187
7415 | ciw | 2 | t | t | 17070963 | Minnesota Ojibwe | 187
9170 | alq | 1 | t | t | 241072 | ???????? | 9170
19 | alq | 0 | t | t | 45808 | anicin?bemowin | 19

The question marks represent Unicode characters that Latex does not handle. The information here does not appear to be entirely correct. Panlex labels a wordlist that Margaret and Howard produced as documenting variety 536 (ciw0), which is Chippewa. I would have thought that they speak Eastern Ojibwa.

Dicts

For each variety, there is a set of dictionaries.

$ panlex dicts 30 536 934 4069 5598 6930 6931 6932 6933 7415 9170 19
128 | Freelang Ojibwe-English dictionary | 13741 | eng-ciw-Weshki
153 | Freelang Ojibwe-English dictionary | 1319 | ciw-ojw-ojc-ojs-ojg-otw-mic-pot-eng-Weshki
611 | Astronomia Terminaro | 2474 | mul-Rapley
2409 | Swadesh Lists | 207 | art-mul-SL
2815 | Anishinaabemowin–English | 131 | ciw-eng-Noori
2830 | Ezhi-Giigidaang, How We Say It (Pronunciation) | 0 | ciw-eng-Kimewon
4091 | Lexique de la langue algonquine | 0 | alq-fra-Cuoq
3778 | Ojibwe Vocabulary Project | 0 | ciw-eng-Manidoons
3779 | Ojibwe-English Wordlist | 0 | ciw-eng-Weshki
4095 | Travels through the Canadas: Vocabulary of the Algonquin Tongue | 0 | alq-eng-Heriot
4144 | The Ojibwe People’s Dictionary | 0 | eng-ciw-OPD

A dictionary may document more than one variety.

Dict

To see information about a dictionary:

$ panlex dict 128
ap | lv
128 | 187
128 | 536

id 128
dt 2007-12-11
tt eng-ciw:Weshki
ur http://www.freelang.net/dictionary/ojibwe.php
bn
au Weshki-ayaad; Charles Lippert; Guy T. Gambill
ti Freelang Ojibwe-English dictionary
pb Freelang
yr 2010
uq 5
ui 128
ul TG 122; FreeLang.English_Ojibwe.wb
li co
ip Every author exercises rights with respect to the part of a list that represents that person’s own contribution.
co Guy T. Gambill
ad gambillgt1@yahoo.com

The first lines indicate which varieties the dictionary documents. In this case, they are 187 (English, eng0) and 536 (Chippewa, ciw0).

Bidicts

To find out which dictionaries document a particular pair of varieties.

$ panlex bidicts 187 536
128 | Freelang Ojibwe-English dictionary | 13741 | eng-ciw-Weshki
153 | Freelang Ojibwe-English dictionary | 1319 | ciw-ojw-ojc-ojs-ojg-otw-mic-pot-eng-Weshki
611 | Astronomia Terminaro | 2474 | mul-Rapley
2409 | Swadesh Lists | 207 | art-mul-SL
2830 | Ezhi-Giigidaang, How We Say It (Pronunciation) | 0 | ciw-eng-Kimewon
3778 | Ojibwe Vocabulary Project | 0 | ciw-eng-Manidoons
4144 | The Ojibwe People's Dictionary | 0 | eng-ciw-OPD

The columns are: dictionary ID (ap.ap) title (ap.ti), number of entries (count where mn.ap==ap), and short code (aped.fp).

Bidict

To extract a bidict:

$ panlex bidict 128 536 187 | uniq > tmp.out

The result is ASCII sorted (case sensitive), in two-column format, with a single tab character as column separator. Let us think of the first column as the target language and the second column as the glossing language. If a target-language word has multiple glosses, they produce multiple lines in the file, all sharing the same target-language word. (Since the file is sorted, they form a contiguous block.) For example, the following occurs in the middle of tmp.out:

aabizh  cut seams open on
aabizhiishin    perk up
aabiziishin     come to
aabiziishin     revive

For some reason, the dictionaries sometimes contain duplicate entries - hence the "uniq" in the command line above.

Panlex module

Zip files

f = open_zipfile()

The Panlex zip file is ~/src/cl/panlex-20140501-csv.zip.

Things you can do with a zip file:

f.namelist()      # list of filenames
f.printdir()      # print long listing
s = f.read(name)  # one of the names from namelist

The entire file is read as a single string.

The list of Panlex files:

>>> from panlex import open_zipfile
>>> f = open_zipfile()
>>> for nm in f.namelist():
...     print nm
...
panlex-20140501-csv/
panlex-20140501-csv/af.csv
panlex-20140501-csv/mi.csv
panlex-20140501-csv/aped.csv
panlex-20140501-csv/df.csv
panlex-20140501-csv/wc.csv
panlex-20140501-csv/av.csv
panlex-20140501-csv/lv.csv
panlex-20140501-csv/fm.csv
panlex-20140501-csv/ex.csv
panlex-20140501-csv/dm.csv
panlex-20140501-csv/cp.csv
panlex-20140501-csv/md.csv
panlex-20140501-csv/dn.csv
panlex-20140501-csv/cu.csv
panlex-20140501-csv/ap.csv
panlex-20140501-csv/wcex.csv
panlex-20140501-csv/mn.csv
panlex-20140501-csv/apli.csv

Reading a file

Raw contents.

s = raw_contents(fn)

The fn omits the directory name and the .csv suffix. That is, legitimate values are "af," "mi," etc.

Reader.

r = reader(fn)

Uses csv.reader to parse the csv format. The return value is an iterator over records, each record being a list of fields. The first record contains the field names.

>>> from panlex import reader
>>> r = reader('af')
>>> r.next()
['ap', 'fm']
>>> r.next()
['1636', '24']

Open file.

(hdr, recs) = open_file(fn)

The header is the list of field names, and recs is an iterator over the content records.

Print headers. Prints the database schema: the names and headers of all the files.

>>> from panlex import print_headers
>>> print_headers()
af: ap fm
mi: mn tt
aped: ap q cx im re ed fp etc
df: df mn lv tt
wc: wc dn ex
av: ap lv
lv: lv lc vc sy am ex
fm: fm tt md
ex: ex lv tt td
dm: dm mn ex
cp: lv c0 c1
md: md dn vb vl
dn: dn mn ex
cu: lv c0 c1 loc vb
ap: ap dt tt ur bn au ti pb yr uq ui ul li ip co ad
wcex: ex tt
mn: mn ap
apli: id li pl

Head and cat. The function head() prints the first n records. The function cat() dumps the contents readably. cat(fn,'html') produces HTML output.

Database tables

Where. Select records containing specified values in a specified field. The return value is an iterator over records.

>>> from panlex import where
>>> for r in where('lv', 'lc', 'deu'):
...     print '|'.join(r)
...
157|deu|0|t|t|274
1349|deu|1|t|t|18586881
1845|deu|2|t|t|18586883
9097|deu|3|t|t|12660638

Expand expressions.

r = expand_expressions(recs, hdr)

Returns an iterator over records. Two new columns are added: the first contains the expression's string, and the second contains the expression's variety.

Extracting dictionaries

Dict entries. The function dict_entry_ids() returns an iterator over the entry IDs (lxids) for a given dictionary or dictionaries.

>>> from panlex import dict_entries
>>> len(list(dict_entry_ids('128')))
13741

The function dict_entry_table() returns a table whose keys are meaning IDs, and whose values are list of pairs of form (lvid, w) where $w$ is a word string.

>>> from panlex import dict_entries
>>> ents = dict_entry_table('128')
>>> len(ents)
13741
>>> mns = list(ents)
>>> mns[0]
'2525999'
>>> ents[mns[0]]
[('187', 'consider'), ('536', 'naagadawaabam')]
>>> ents[mns[1]]
[('187', 'knock against'), ('536', 'bitaakoshkan')]

Bilex pairs. The function bilex_pairs() returns an alphabetically sorted list of word pairs representing the entries of the given dictionary.

>>> from panlex import bilex_pairs
>>> pairs = bilex_pairs('128','536','187')
>>> pairs[0]
['Aabamadong', 'Fort Hope']
>>> len(pairs)
13739

Note that the pair of language IDs is not predictable from the dictionary. The dictionary may contain more than two languages, and even if it only contains two, the dictionary does not specify their order.

The database

Zip file

The database dump is contained in a zip file. The class ZipFile is used to access it.

>>> from seal.data.panlex import ZipFile
>>> zf = ZipFile()

Methods are provided for listing the contents of the zip file.

>>> zf.ls()
File Name                                             Modified             Size
panlex-20140501-csv/                           2014-05-01 03:02:18            0
panlex-20140501-csv/af.csv                     2014-05-01 03:00:04        38522
panlex-20140501-csv/mi.csv                     2014-05-01 03:02:00     33214449
...
>>> list(zf.filenames())
['af', 'mi', 'aped', 'df', 'wc', 'av', 'lv', 'fm', 'ex', ..., 'apli']

The method print_headers() prints out, for each table, its name and field names. It takes a minute or two to run.

>>> zf.print_headers()
af: ap fm
mi: mn tt
aped: ap q cx im re ed fp etc
...

To print the contents of the tables, the methods head and cat are provided.

>>> zf.head('wcex', 3)
ex | tt
3846607 | noun
3846608 | verb
>>> zf.cat('wcex')
ex | tt
3846607 | noun
3846608 | verb
3846609 | adjv
...

The method table returns a Table object containing the contents of the table. If the table contains an ex field, two new fields named ex.tt and ex.lv are added to each record. This method can be slow to run.

Tables

A Table is a collection of records. It has the following members and methods:

header
A list of strings.
records
A list of records, each record being a list of strings.
where(f,v)
Returns a new Table containing the subset of records in which field f has value v.
dump()
Prints out the table.
grep(f,v)
Prints out the subtable for which field f has value v.

Parser

A Parser instance digests the information in the tables.

Compiler

The value of compile is a Compiler instance. It is used to create digested files. If called with no arguments, it creates the files

Utility functions

The function attribute_entries() iterates over the records for a given subject type or a given subject-relation pair. For example:

>>> i = attribute_entries('expression', 'label')
>>> i.next()
(('expression', 'label', 'string'), '3990756' u'!')

The entries are of form (t, v_1, v_2), where t is of form (t_1, r, t_2).

Collect variety languages. The function collect_variety_languages() iterates over the variety-language records, and constructs a table indexed by variety ID (an int), whose value is the variety's language. E.g.:

>>> vlangs = collect_variety_languages()
>>> vlangs[187]
'eng'

Collect approvers. The function collect_approvers() returns a table indexed by approver ID, in which the values are lists of form [lang, variety, quality, title].

Extracting bilexicons. A bilexicon is represented in Python by the class Bilex:

>>> b = Bilex('spa','eng')

Create raw. The first step is to create the raw bilexicon:

>>> b.create_raw()

This takes about 25 minutes to run. The output (in this example) is the file spa-eng-raw.txt in the directory /cl/data/panlex/lex.

The create_raw() method starts by loading the variety-language table, which maps varieties to their languages.

Then it goes through the expression-variety records, creating a table of expressions. The keys are expressions (ints) and the values are lists of form [variety, label, degraded text]. An entry is created only for expressions whose variety's language is one of the two languages of interest. Label and degraded text are initially set to the empty string.

Next it goes through the expression-label and expression-degraded-text records, filling in the other fields of the expression entries.

Next it creates a denotations table. It goes through the denotation-expression records. If the expression has an entry in the expressions table, then a new entry is created in the denotations table. The key is the denotation (an int), and the value is a list of form [expression, part of speech, meaning]. Initially only the expression is set. Part of speech is initialized to the empty string and meaning is initialized to 0.

Next it goes through the denotation-pos records and the denotation-meaning records, filling in the remaining fields in the denotation entries.

By that point, memory is pretty much full. Output is written to lang1-lang2-raw.txt. We pass through the denotations table. Each denotation entry contains an expression ID, we use it to fetch the expression entry. The expression entry contains a variety ID; we use it to look up the language. Each denotation generates one line of output, of form:

m lang v expr degraded pos d e

The single letters represent integer IDs: meaning (m), variety (v), denotation (d), expression (e). The denotation and expression IDs are included only for debugging purposes.

Sort raw. The method sort_raw() calls Unix sort to sort the raw file by meaning, language, variety, and label. The output is written to lang1-lang2-m1.txt. It takes a couple minutes to run.

Create m2. The method create_m2() adds approvers, and also filters out monolingual meanings. (I tried adding approvers when creating the raw file, but Python runs out of memory.)

>>> b.create_m2()

The method scans through the m1.txt file, collecting a table of meanings. For each block of meanings, note is kept of whether both languages are seen. If so, an entry is created in the meanings table, and otherwise no entry is created. The meanings table is indexed by meaning ID, and the value is the approver ID (initialized to 0).

After creating the meanings table, the method passes through the meaning-approver records and sets the values (approvers) for the meanings.

Next it calls collect_approvers() to get the quality information for each approver.

Finally, it passes a second time through the m1.txt file. Each time it encounters a new meaning, it looks in the meanings table to see whether it should be kept or not. If the meaning is a keeper, the quality of the approver is looked up in the approvers table. Each line from m1.txt that is to be kept is copied to m2.txt, and two new fields are added at the end: approver ID and quality. Hence the lines in m2.txt are of form:

m lang v expr degraded pos d e a q

where "a" is approver and "q" is quality (both are ints).

Create sources. The method create_sources() extracts detailed information about each of the approvers. It writes the file lang1-lang2-sources.txt. The line format is:

a rel value

where "a" is the approver ID. The relations (attributes) are: lang, variety, regdate, label, creator, isbn, lic_id, license, year, publ, title, and url. An empty line is inserted before each block of records sharing a common value for "a."

By word. The method by_word() creates a file containing lines of form

word-lang1 quality word-lang2

The method sort_by_word() then sorts that file.

It turns out that the quality scores for the approvers are not very informative about whether the entries are actually good. For example, the top quality source (quality 7) for the Spanish word "a" includes meanings "crazy," "missionary," and "physical" - completely bogus. A much better gauge appears to be the number of sources in which the translation occurs.