Database APIs

APIs for different linguistic databases can be accessed with lingtypology.db_apis.

In [1]:
import lingtypology.db_apis

1. General

Lingtypology attempts to provide unified API for given language databases. Therefore, classes in this module share some common attributes and methods. In this paragraph I will describe them and provide examples for Autotyp, Wals and Phoible.

In [2]:
from lingtypology.db_apis import Autotyp, Wals, Phoible

1.1. features_list

You can get the list of available features from the database using this attribute.

In [3]:
Autotyp().features_list[:10] #It's cutoff in order not to take took much space
Out[3]:
['Agreement',
 'Alienability',
 'Alignment',
 'Alignment_case_splits',
 'Alignment_per_language',
 'Clause_linkage',
 'Clause_word_order',
 'Clusivity',
 'GR_per_language',
 'Gender']

Note: Phoible has no features_list attribute because there are no features. However, it has subsets_list that shows list of available subsets of Phoible data.

In [4]:
Phoible().subsets_list
Out[4]:
['all', 'UPSID', 'SPA', 'AA', 'PH', 'GM', 'RA', 'SAPHON']

1.2. get_df and get_json

These two methods access the database and return data as pandas.Series or dict. Example of usage:

In [5]:
Autotyp('Agreement', 'Clusivity').get_df().head()
Bickel, Balthasar, Johanna Nichols, Taras Zakharko,
Alena Witzlack-Makarevich, Kristine Hildebrandt, Michael Rießler,
Lennart Bierkandt, Fernando Zúñiga & John B. Lowe.
2017. The AUTOTYP typological databases.
Version 0.1.0 https://github.com/autotyp/autotyp-data/tree/0.1.0
(get_by_glot_id) Warning: language by ngiy1239 not found
(get_by_glot_id) Warning: language by east2283 not found
(get_by_glot_id) Warning: language by donn1238 not found
(get_by_glot_id) Warning: language by  not found
(get_by_glot_id) Warning: language by  not found
Out[5]:
language LID VPolyagreement.Presence.v2 VPolyagreement.Presence.v1 InclExclAsPerson.Presence InclExclAny.Presence InclExclType InclExclAsMinAug.Presence
0 Ambulas 6 False False False False no i/e False
1 Abkhazian 7 True True False False no i/e False
2 Achinese 9 True False False True plain i/e type False
3 Western Keres 10 True True False False no i/e False
4 Hokkaido Ainu 12 True True False True plain i/e type False

Note: for Phoible and Autotyp you can use strip_na parameter (list, default: []) to strip rows in which there is empty cell in the given columns. Compare the following.
No strip_na (empty cells are replaced with '~N/A~'):

In [6]:
Phoible().get_df().head()
Moran, Steven & McCloy, Daniel (eds.) 2019.
PHOIBLE 2.0.
Jena: Max Planck Institute for the Science of Human History.
(Available online at http://phoible.org, Accessed on 2019-05-15.)
Out[6]:
contribution_name language coordinates glottocode macroarea phonemes consonants vowels tones source inventory_page
0 Korean (SPA 1) Korean (37.5, 128.0) kore1280 Eurasia 40 22 18 0 https://archive.org/details/kor_SPA1979_phon https://phoible.org/languages/kore1280
1 KOREAN (UPSID 423) Korean (37.5, 128.0) kore1280 Eurasia 32 21 11 ~N/A~ http://web.phonetik.uni-frankfurt.de/L/L2170.html https://phoible.org/languages/kore1280
2 Ket (SPA 2) Ket (63.7551, 87.5466) kett1243 Eurasia 32 18 14 0 https://archive.org/details/ket_SPA1979_phon https://phoible.org/languages/kett1243
3 KET (UPSID 399) Ket (63.7551, 87.5466) kett1243 Eurasia 25 18 7 ~N/A~ http://web.phonetik.uni-frankfurt.de/L/L2706.html https://phoible.org/languages/kett1243
4 Lak (SPA 3) Lak (42.1328, 47.0809) lakk1252 Eurasia 69 60 9 0 https://archive.org/details/lbe_SPA1979_phon https://phoible.org/languages/lakk1252

tones column given to strip_na:

In [7]:
Phoible().get_df(strip_na=['tones']).head()
Moran, Steven & McCloy, Daniel (eds.) 2019.
PHOIBLE 2.0.
Jena: Max Planck Institute for the Science of Human History.
(Available online at http://phoible.org, Accessed on 2019-05-15.)
Out[7]:
contribution_name language coordinates glottocode macroarea phonemes consonants vowels tones source inventory_page
0 Korean (SPA 1) Korean (37.5, 128.0) kore1280 Eurasia 40 22 18 0 https://archive.org/details/kor_SPA1979_phon https://phoible.org/languages/kore1280
2 Ket (SPA 2) Ket (63.7551, 87.5466) kett1243 Eurasia 32 18 14 0 https://archive.org/details/ket_SPA1979_phon https://phoible.org/languages/kett1243
4 Lak (SPA 3) Lak (42.1328, 47.0809) lakk1252 Eurasia 69 60 9 0 https://archive.org/details/lbe_SPA1979_phon https://phoible.org/languages/lakk1252
6 Kabardian (SPA 4) Kabardian (43.5082, 43.3918) kaba1278 Eurasia 56 49 7 0 https://archive.org/details/kbd_SPA1979_phon https://phoible.org/languages/kaba1278
8 Georgian (SPA 5) Georgian (41.850396999999994, 43.78613) nucl1302 Eurasia 35 29 6 0 https://archive.org/details/kat_SPA1979_phon https://phoible.org/languages/nucl1302

Note: By default when you call get_df or get_json it prints the citation. If you want to disable it, you shoud set the show_citation to False.

In [8]:
p = Phoible()
p.show_citation = False
p.get_df(strip_na=['tones']).head()
Out[8]:
contribution_name language coordinates glottocode macroarea phonemes consonants vowels tones source inventory_page
0 Korean (SPA 1) Korean (37.5, 128.0) kore1280 Eurasia 40 22 18 0 https://archive.org/details/kor_SPA1979_phon https://phoible.org/languages/kore1280
2 Ket (SPA 2) Ket (63.7551, 87.5466) kett1243 Eurasia 32 18 14 0 https://archive.org/details/ket_SPA1979_phon https://phoible.org/languages/kett1243
4 Lak (SPA 3) Lak (42.1328, 47.0809) lakk1252 Eurasia 69 60 9 0 https://archive.org/details/lbe_SPA1979_phon https://phoible.org/languages/lakk1252
6 Kabardian (SPA 4) Kabardian (43.5082, 43.3918) kaba1278 Eurasia 56 49 7 0 https://archive.org/details/kbd_SPA1979_phon https://phoible.org/languages/kaba1278
8 Georgian (SPA 5) Georgian (41.850396999999994, 43.78613) nucl1302 Eurasia 35 29 6 0 https://archive.org/details/kat_SPA1979_phon https://phoible.org/languages/nucl1302

1.3. citation

You can get the citation for each database using citation attribute. E.g.:

In [9]:
from lingtypology.db_apis import Autotyp
print(Autotyp().citation)
Bickel, Balthasar, Johanna Nichols, Taras Zakharko,
Alena Witzlack-Makarevich, Kristine Hildebrandt, Michael Rießler,
Lennart Bierkandt, Fernando Zúñiga & John B. Lowe.
2017. The AUTOTYP typological databases.
Version 0.1.0 https://github.com/autotyp/autotyp-data/tree/0.1.0

Note: if you use Wals, citation will be shown for every feature. If you want general citation for the whole Wals, use general_citation.

In [10]:
w = Wals('1a', '2a')
print(w.citation)
Citation for feature 1A:
Ian Maddieson. 2013. Consonant Inventories.
In: Dryer, Matthew S. & Haspelmath, Martin (eds.)
The World Atlas of Language Structures Online.
Leipzig: Max Planck Institute for Evolutionary Anthropology.
(Available online at http://wals.info/chapter/1, Accessed on 2019-05-15.)

Citation for feature 2A:
Ian Maddieson. 2013. Vowel Quality Inventories.
In: Dryer, Matthew S. & Haspelmath, Martin (eds.)
The World Atlas of Language Structures Online.
Leipzig: Max Planck Institute for Evolutionary Anthropology.
(Available online at http://wals.info/chapter/2, Accessed on 2019-05-15.)


In [11]:
print(w.general_citation)
Dryer, Matthew S. & Haspelmath, Martin (eds.) 2013.
The World Atlas of Language Structures Online.
Leipzig: Max Planck Institute for Evolutionary Anthropology.
(Available online at http://wals.info, Accessed on 2019-05-15.)

2. Wals

It is possible to access Wals data (online) using lingtypology.db_apis.Wals

In [12]:
from lingtypology.db_apis import Wals
In [13]:
wals_page = Wals('1a', '2a').get_df()
wals_page.head()
Citation for feature 1A:
Ian Maddieson. 2013. Consonant Inventories.
In: Dryer, Matthew S. & Haspelmath, Martin (eds.)
The World Atlas of Language Structures Online.
Leipzig: Max Planck Institute for Evolutionary Anthropology.
(Available online at http://wals.info/chapter/1, Accessed on 2019-05-15.)

Citation for feature 2A:
Ian Maddieson. 2013. Vowel Quality Inventories.
In: Dryer, Matthew S. & Haspelmath, Martin (eds.)
The World Atlas of Language Structures Online.
Leipzig: Max Planck Institute for Evolutionary Anthropology.
(Available online at http://wals.info/chapter/2, Accessed on 2019-05-15.)

Out[13]:
wals code language genus family area coordinates _1A _2A
0 kiw Kiwai (Southern) Kiwaian Kiwaian Phonology (-8.0, 143.5) Small Average (5-6)
1 xoo !Xóõ Tu Tu Phonology (-24.0, 21.5) Large Average (5-6)
2 ani //Ani Khoe-Kwadi Khoe-Kwadi Phonology (-18.9166666667, 21.9166666667) Large Average (5-6)
3 abi Abipón South Guaicuruan Guaicuruan Phonology (-29.0, -61.0) Moderately small Average (5-6)
4 abk Abkhaz Northwest Caucasian Northwest Caucasian Phonology (43.0833333333, 41.0) Large Small (2-4)

Map example for feature 1A:

In [14]:
m = lingtypology.LingMap(wals_page.language)
m.add_custom_coordinates(wals_page.coordinates)
m.add_features(wals_page._1A)
m.legend_title = 'Consonant Inventory'
m.create_map()
Out[14]:

2. Autotyp

It is possible to access Autotyp data (online) using lingtypology.db_apis.

Unlike in Wals, each new tablename passed into Autotyp gives several additional columns:

In [15]:
Autotyp_table = Autotyp('Gender', 'Agreement').get_df(strip_na=['Gender.binned4'])
Autotyp_table.head()
Bickel, Balthasar, Johanna Nichols, Taras Zakharko,
Alena Witzlack-Makarevich, Kristine Hildebrandt, Michael Rießler,
Lennart Bierkandt, Fernando Zúñiga & John B. Lowe.
2017. The AUTOTYP typological databases.
Version 0.1.0 https://github.com/autotyp/autotyp-data/tree/0.1.0
(get_by_glot_id) Warning: language by ngiy1239 not found
(get_by_glot_id) Warning: language by east2283 not found
Out[15]:
language LID Gender.n Gender.binned4 Gender.Presence VPolyagreement.Presence.v2 VPolyagreement.Presence.v1
0 Godoberi 1531 3 3 genders True False False
1 Bininj Kun-Wok 655 4 4 genders True True True
2 Luvale 553 10 more than 4 genders True True False
3 North-Central Dargwa 2949 3 3 genders True True True
4 Gaagudju 82 4 4 genders True True True

Now we can draw a map out of gender data from multiple languages.

In [16]:
m = lingtypology.LingMap(Autotyp_table.language)
m.add_features(Autotyp_table['Gender.binned4'])
m.colors = lingtypology.gradient(4, color1='yellow', color2='red')
m.legend_title = 'Genders'
m.create_map()
Out[16]:

3. AfBo

In [17]:
from lingtypology.db_apis import AfBo
In [18]:
adj = AfBo('adjectivizer').get_df()
adj.head()
Seifart, Frank. 2013.
AfBo: A world-wide survey of affix borrowing.
Leipzig: Max Planck Institute for Evolutionary Anthropology.
(Available online at http://afbo.info, Accessed on 2019-05-15.)
Out[18]:
language_recipient language_donor reliability adjectivizer
0 Resígaro Bora high 0
1 Gurindji Kriol Gurindji high 0
2 Copper Island Aleut Russian high 0
3 Sakha Mongolian high 4
4 Kalderash Romani Romanian high 1
In [19]:
m = lingtypology.LingMap(adj.language_recipient)
m.add_features(adj['adjectivizer'], numeric=True)
m.legend_title = 'Adj'
m.create_map()
Out[19]:

4. SAILS

In [20]:
from lingtypology.db_apis import Sails

To get a pandas.DataFrame of features and descriptions:

In [21]:
Sails().features_descriptions.head()
Out[21]:
Feature Description
0 ICU17 Is plurality in independent pronouns expressed...
1 ICU16 Is plurality in independent pronouns expressed...
2 ICU15 Is plurality in independent pronouns expressed...
3 ICU14 Is an associative or collective plural disting...
4 ICU13 Are nouns denoting inanimates marked for plural?

Get description for particular features:

In [22]:
Sails().feature_descriptions('ICU10', 'ICU11')
Out[22]:
Feature Description
0 ICU10 Is nominal plural marking obligatory?
1 ICU11 Are nouns denoting humans marked for plural?

To get the SAILS data as dict, you can use get_json method. To get data as pandas.DataFrame you can run:

In [23]:
sails = Sails('ICU10', 'ICU11')
df = sails.get_df()
df.head()
You probably should cite it, but I don't understand how. Please, consult https://sails.clld.org/
Out[23]:
language coordinates ICU10 ICU10_desc ICU11 ICU11_desc
0 Tol (14.66859, -87.03719) 0 No 1 Yes
1 San Blas Kuna (9.15686, -78.3075) 0 No 1 Yes
2 Wayuu (10.22515, -71.81012) 0 No 1 Yes
3 Northern Emberá (7.127610000000001, -77.57396) 0 No 1 Yes
4 Páez (2.61516, -76.31254) 0 No 1 Yes

Map example:

In [24]:
m = lingtypology.LingMap(df.language)
m.add_features(df.ICU10_desc)
m.legend_title = sails.feature_descriptions('ICU10').Description.at[0]
m.start_location = (9, -79)
m.start_zoom = 5
m.create_map()
Out[24]:

5. Phoible

In [25]:
from lingtypology.db_apis import Phoible

Unlike in other databases you do not pass features into Phoible. You should pass the subset. Take a look:

In [26]:
p = Phoible()
p.get_df().head()
Moran, Steven & McCloy, Daniel (eds.) 2019.
PHOIBLE 2.0.
Jena: Max Planck Institute for the Science of Human History.
(Available online at http://phoible.org, Accessed on 2019-05-15.)
Out[26]:
contribution_name language coordinates glottocode macroarea phonemes consonants vowels tones source inventory_page
0 Korean (SPA 1) Korean (37.5, 128.0) kore1280 Eurasia 40 22 18 0 https://archive.org/details/kor_SPA1979_phon https://phoible.org/languages/kore1280
1 KOREAN (UPSID 423) Korean (37.5, 128.0) kore1280 Eurasia 32 21 11 ~N/A~ http://web.phonetik.uni-frankfurt.de/L/L2170.html https://phoible.org/languages/kore1280
2 Ket (SPA 2) Ket (63.7551, 87.5466) kett1243 Eurasia 32 18 14 0 https://archive.org/details/ket_SPA1979_phon https://phoible.org/languages/kett1243
3 KET (UPSID 399) Ket (63.7551, 87.5466) kett1243 Eurasia 25 18 7 ~N/A~ http://web.phonetik.uni-frankfurt.de/L/L2706.html https://phoible.org/languages/kett1243
4 Lak (SPA 3) Lak (42.1328, 47.0809) lakk1252 Eurasia 69 60 9 0 https://archive.org/details/lbe_SPA1979_phon https://phoible.org/languages/lakk1252

There are several entries for different languages: it happens because Phoible data consists of several different subsets. You can get the list of available subsets:

In [27]:
p.subsets_list
Out[27]:
['all', 'UPSID', 'SPA', 'AA', 'PH', 'GM', 'RA', 'SAPHON']

... and pass them into the class:

In [28]:
p = Phoible(subset='SPA')
df = p.get_df(strip_na=['tones'])
df.head()
Moran, Steven & McCloy, Daniel (eds.) 2019.
PHOIBLE 2.0.
Jena: Max Planck Institute for the Science of Human History.
(Available online at http://phoible.org, Accessed on 2019-05-15.)
Out[28]:
contribution_name language coordinates glottocode macroarea phonemes consonants vowels tones source inventory_page
0 Korean (SPA 1) Korean (37.5, 128.0) kore1280 Eurasia 40 22 18 0 https://archive.org/details/kor_SPA1979_phon https://phoible.org/languages/kore1280
1 Ket (SPA 2) Ket (63.7551, 87.5466) kett1243 Eurasia 32 18 14 0 https://archive.org/details/ket_SPA1979_phon https://phoible.org/languages/kett1243
2 Lak (SPA 3) Lak (42.1328, 47.0809) lakk1252 Eurasia 69 60 9 0 https://archive.org/details/lbe_SPA1979_phon https://phoible.org/languages/lakk1252
3 Kabardian (SPA 4) Kabardian (43.5082, 43.3918) kaba1278 Eurasia 56 49 7 0 https://archive.org/details/kbd_SPA1979_phon https://phoible.org/languages/kaba1278
4 Georgian (SPA 5) Georgian (41.850396999999994, 43.78613) nucl1302 Eurasia 35 29 6 0 https://archive.org/details/kat_SPA1979_phon https://phoible.org/languages/nucl1302

You can also get non-aggregated data by setting aggregated to False while initializing the class.

In [29]:
Phoible(aggregated=False).get_df().head()
Moran, Steven & McCloy, Daniel (eds.) 2019.
PHOIBLE 2.0.
Jena: Max Planck Institute for the Science of Human History.
(Available online at http://phoible.org, Accessed on 2019-05-15.)
Out[29]:
InventoryID Glottocode ISO6393 LanguageName SpecificDialect GlyphID Phoneme Allophones Marginal SegmentClass ... retractedTongueRoot advancedTongueRoot periodicGlottalSource epilaryngealSource spreadGlottis constrictedGlottis fortis raisedLarynxEjective loweredLarynxImplosive click
0 1 kore1280 kor Korean ~N/A~ 0061 a a ~N/A~ vowel ... - - + - - - 0 - - 0
1 1 kore1280 kor Korean ~N/A~ 0061+02D0 ~N/A~ vowel ... - - + - - - 0 - - 0
2 1 kore1280 kor Korean ~N/A~ 00E6 æ ɛ æ ~N/A~ vowel ... - - + - - - 0 - - 0
3 1 kore1280 kor Korean ~N/A~ 00E6+02D0 æː æː ~N/A~ vowel ... - - + - - - 0 - - 0
4 1 kore1280 kor Korean ~N/A~ 0065 e e ~N/A~ vowel ... - - + - - - 0 - - 0

5 rows × 48 columns

Map example:

In [30]:
m = lingtypology.LingMap(df.language)
m.colormap_colors = ('white', 'red')
m.add_features(df.tones, numeric=True)
m.start_zoom = 1
m.legend_title = 'Tones'
m.create_map()
Out[30]:

Another example (slow due to large amount of data):

In [31]:
df = Phoible(subset='UPSID', aggregated=False).get_df()
#Get all languages with ejectives
df = df[df.raisedLarynxEjective == '+']
#Remove duplicates
df = df.drop_duplicates(subset='Glottocode')
df.head()
#A warning appears randomly. No idea why.
Moran, Steven & McCloy, Daniel (eds.) 2019.
PHOIBLE 2.0.
Jena: Max Planck Institute for the Science of Human History.
(Available online at http://phoible.org, Accessed on 2019-05-15.)
/usr/lib64/python3.7/site-packages/pandas/core/frame.py:4034: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  downcast=downcast, **kwargs)
Out[31]:
InventoryID Glottocode ISO6393 LanguageName SpecificDialect GlyphID Phoneme Allophones Marginal SegmentClass ... retractedTongueRoot advancedTongueRoot periodicGlottalSource epilaryngealSource spreadGlottis constrictedGlottis fortis raisedLarynxEjective loweredLarynxImplosive click
7570 198 afad1236 aal KOTOKO ~N/A~ 0063+02BC ~N/A~ False consonant ... 0 0 - - - + - + - -
7802 206 ahte1237 aht AHTNA ~N/A~ 006B+02BC ~N/A~ False consonant ... 0 0 - - - + - + - -
7920 211 qawa1238 alc QAWASQAR ~N/A~ 006B+02BC ~N/A~ False consonant ... 0 0 - - - + - + - -
8131 218 hame1242 amf HAMER ~N/A~ 0071+02BC ~N/A~ False consonant ... 0 0 - - - + - + - -
8157 219 amha1245 amh AMHARIC ~N/A~ 006B+02B7+02BC kʷʼ ~N/A~ False consonant ... 0 0 - - - + - + - -

5 rows × 48 columns

In [32]:
m = lingtypology.LingMap(df.Glottocode, glottocode=True)
m.title = 'Languages with Ejectives'
m.create_map()
Out[32]: