Get started

The google_ngram function supports different varieties of English (e.g., British, American) and allows aggregation by year or decade. The package also supports the analysis of time series data using TimeSeries.

Fetching data

First we will import the functions:

from google_ngrams import google_ngram

Then, we can fetch, for example, x-ray by year in American English:

xray_year = google_ngram(word_forms = ["x-ray"], variety = "us", by = "year")

Accessing repository. For larger ones
(e.g., ngrams containing 2 or more words).
This may take a few minutes...
xray_year.head()
shape: (5, 4)
Year Token AF RF
i32 list[str] i64 f64
1818 ["x - ray"] 3 0.046198
1819 ["x - ray"] 0 0.0
1820 ["x - ray"] 0 0.0
1821 ["x - ray"] 0 0.0
1822 ["x - ray"] 0 0.0

Alternatively, the following would return counts of the combined forms x-ray and x-rays in British English by decade:

xray_decade = google_ngram(word_forms = ["x-ray", "x-rays"], variety = "gb", by = "decade")

Accessing repository. For larger ones
(e.g., ngrams containing 2 or more words).
This may take a few minutes...
xray_decade.head()
shape: (5, 4)
Decade Token AF RF
i32 list[str] i64 f64
1710 ["x - ray", "x - rays"] 2 0.159487
1720 ["x - ray", "x - rays"] 0 0.0
1730 ["x - ray", "x - rays"] 0 0.0
1740 ["x - ray", "x - rays"] 0 0.0
1750 ["x - ray", "x - rays"] 0 0.0

Analyzing data

To analyze data, import TimeSeries:

from google_ngrams import TimeSeries

To use TimeSeries, provide a polars DataFrame, a column that identifies the time sequence and a values column that identifies the frequency varieble:

xray_ts = TimeSeries(time_series=xray_decade, time_col='Decade', values_col='RF')

We can now generate visualizations like a barplot of frequencies by decade:

xray_ts.timeviz_barplot();

Filter data before VNC clustering

Note that the frequencies in this example are 0 or near 0 until the turn of the twentieth century.

Vizualizing VNC clustering can be made clearer by filtering out extended periods with no data. Thus, plots like this bar plot (or a similar scatterplot for by-year data) can be combined to effectively describe trajectories of change and periodization.

Filter the data

import polars as pl

xray_filtered = xray_decade.filter(pl.col("Decade") >= 1900)
xray_filtered_ts = TimeSeries(time_series=xray_filtered, time_col='Decade', values_col='RF')
xray_filtered_ts.timeviz_vnc();