Ethnologue language listing
Overview
The database in seal.data.langdb is compiled by merging data from
the Ethnologue, from the Library of Congress's official
ISO 639-2 database, and from Panlex.
It uses the iso-639-2 and iso-639-3 packages.
The database is called languages:
>>> from seal.data.langdb import languages
The information in languages exactly reflects the published
databases, with the following exceptions:
- In the published databases, retired codes had no entry for scope
or type, with the exception of one retired code, which had
scope-type of IL (living language). For the sake of
uniformity, I have assigned all retired codes scope 'R' and
type 'R'.
- In the published databases, the names field was filled
only if the language had multiple names, in which case names
included the reference name. For the sake of uniformity, names
now always includes the reference name, and may be a singleton list
containing only the reference name. Inverted names have been
treated similarly.
Language codes
Code sets
The standard three-letter language codes are ISO 639-3 codes. There
are several other code sets in the ISO 639 family.
- ISO 639-1: These are the standard two-letter language codes.
Only 184 languages have a 639-1 code.
- ISO 639-2: These were created for librarians. 418 languages
have a 639-2 code. 20 languages have two different 639-2 codes:
a "bibliographic" code (639-2/B) and a "terminological" code (639-2/T).
The Library of Congress is the registration authority.
- ISO 639-2/B: The bibliographic version of 639-2 codes.
These do not always agree with 639-3.
- ISO 639-2/T: The terminological version of 639-2 codes. These
constitute a subset of 639-3.
- ISO 639-3: The standard three-letter language codes. SIL is the
registration authority. These extend the 639-2/T codes to 8121 languages.
- ISO 639-5: An extension to 639-2 to cover language groupings.
The Library of Congress is the registration authority.
Access by code
The database can be accessed by ISO-639-3 code to get a language:
>>> print(languages['spa'])
Code: spa
Code2B: spa
Code2T: spa
Code1: es
Type: Living
Scope: Language
RefName: Spanish
Name: Spanish
Varieties:
Dicts:
The four codes listed are 639-3, 639-2/B, 639-2/T, and 639-1, in that order.
Language instances
Although one accesses languages as a table, one iterates over it
as a list of languages.
>>> len(languages)
8282
>>> sum(1 for lang in languages if lang.code2b != lang.code2t)
20
A language instance has the following members:
- code
- The 639-3 language code (a string).
- code2b
- The 639-2/B language code, or None.
- code2t
- The 639-2/T language code, or None.
- code1
- The 639-1 language code, or None.
- scope
- The value is 'I' for individual language,
'M' for macrolanguage, 'S' for special code, and 'R'
for retired codes.
The special codes are used when one needs a code for something that is
not actually a language. They are 'mis' for an uncoded language, 'mul'
when the thing to be coded contains multiple different languages, 'und'
when the language is undetermined, and 'zxx' when the thing to
be coded does not actually have linguistic content.
- type
- The value is 'A' for an ancient language, 'C'
for a constructed language, 'E' for an extinct language, 'H'
for an historical language, 'L' for a living language, 'S'
for a special code, and 'R' for retired codes.
- name
- The reference name for the language.
- names
- All names for the language, including the reference name.
- inv{\underscore
n- ames} Inverted names (like 'English, Old').
- comment
- Comments.
- parent
- The macrolanguage that this language belongs to,
if any.
- members
- The member languages, if this is a
macrolanguage.
- retirement
- None unless this is a retired code. If this
is a retired code, the value is an object with the following members:
code repeats the language code, name repeats the name,
reason is the retirement reason, date is the retirement
date (a string), replacement is the new code this one was
replaced with (if any), and split is an English string indicating
which codes this one was split into (if any). The retirement reasons
are: 'C' for a code change, 'D' for deletion of a
duplicate code, 'M' for the merger of multiple codes into a new
code, 'S' for the splitting of one code into multiple codes, and
'N' for deleting of a code that represents a non-existent
language. There is a value for replacement for the 'C',
'D', and 'M' cases, and a a value for split for the
'S' case.
- varieties
- The varieties of this language, as identified by
Panlex. For details about varieties, see the chapter on Panlex.
Search
Normalization
The methods named(), find() and search()
permit one to search for languages by name.
All three methods normalize both the language names and the search
key, as follows:
-
Letters are normalized to lower case. I.e., search is
case-insensitive:
>>> languages.named('SPANISH')
[<Living Language spa 'Spanish'>]
-
Anything in parentheses is ignored.
>>> languages.named('wali')
[<Living Language wll 'Wali (Sudan)'>, <Living Language wlx 'Wali (Ghana)'>]
-
Hyphens are treated like spaces:
>>> languages.named('karkar yuri')
[<Living Language yuj 'Karkar-Yuri'>]
>>> languages.named('old-english')
[<Historical Language ang 'Old English (ca. 450-1100)'>]
-
Accents are removed:
>>> languages.named('yari')
[<Retired Code Retired Code yri 'Yar\xed'>]
Here the Unicode character U+ED represents í (letter i
with an acute accent).
-
Everything that is not a letter is ignored in comparison:
>>> languages.named('p!ao?')
[<Living Language blk "Pa'o Karen">, <Retired Code Retired Code ppa 'Pao'>]
By name
One can access languages by complete name. Since names are sometimes
ambiguous, this returns a list:
>>> languages.named('spanish')
[<Living Language spa 'Spanish'>]
>>> languages.named('pao')
[<Living Language blk "Pa'o Karen">, <Retired Code Retired Code ppa 'Pao'>]
Note that the key need not be the
reference name: "Pa'O" is one of the alternate names for language
blk:
>>> languages['blk'].names
["Pa'O", "Pa'o Karen"]
>>> languages['blk'].name
"Pa'o Karen"
By name part
The method named() does not find a language if one provides only
part of the name:
>>> languages.named('chin')
>>> languages.named('matu chin')
[<Living Language hlt 'Matu Chin'>]
To find a language if one knows only part of the name, used the method
find():
>>> len(languages.find('chin'))
33
By character sequence
The method find() looks for complete words in the name.
(Remember that hyphen is treated as a word separator.) To find a
language given only a part of a word, use search():
>>> languages.search('ruman')
[<Living Language rup 'Macedo-Romanian'>]