from google_ngrams import google_ngram
Get started
The google_ngram
function supports different varieties of English (e.g., British, American) and allows aggregation by year or decade. The package also supports the analysis of time series data using TimeSeries
.
Fetching data
First we will import the functions:
Then, we can fetch, for example, x-ray by year in American English:
= google_ngram(word_forms = ["x-ray"], variety = "us", by = "year") xray_year
Accessing repository. For larger ones
(e.g., ngrams containing 2 or more words).
This may take a few minutes...
xray_year.head()
Year | Token | AF | RF |
---|---|---|---|
i32 | list[str] | i64 | f64 |
1818 | ["x - ray"] | 3 | 0.046198 |
1819 | ["x - ray"] | 0 | 0.0 |
1820 | ["x - ray"] | 0 | 0.0 |
1821 | ["x - ray"] | 0 | 0.0 |
1822 | ["x - ray"] | 0 | 0.0 |
Alternatively, the following would return counts of the combined forms x-ray and x-rays in British English by decade:
= google_ngram(word_forms = ["x-ray", "x-rays"], variety = "gb", by = "decade") xray_decade
Accessing repository. For larger ones
(e.g., ngrams containing 2 or more words).
This may take a few minutes...
xray_decade.head()
Decade | Token | AF | RF |
---|---|---|---|
i32 | list[str] | i64 | f64 |
1710 | ["x - ray", "x - rays"] | 2 | 0.159487 |
1720 | ["x - ray", "x - rays"] | 0 | 0.0 |
1730 | ["x - ray", "x - rays"] | 0 | 0.0 |
1740 | ["x - ray", "x - rays"] | 0 | 0.0 |
1750 | ["x - ray", "x - rays"] | 0 | 0.0 |
Analyzing data
To analyze data, import TimeSeries
:
from google_ngrams import TimeSeries
To use TimeSeries
, provide a polars DataFrame, a column that identifies the time sequence and a values column that identifies the frequency varieble:
= TimeSeries(time_series=xray_decade, time_col='Decade', values_col='RF') xray_ts
We can now generate visualizations like a barplot of frequencies by decade:
; xray_ts.timeviz_barplot()
Note that the frequencies in this example are 0 or near 0 until the turn of the twentieth century.
Vizualizing VNC clustering can be made clearer by filtering out extended periods with no data. Thus, plots like this bar plot (or a similar scatterplot for by-year data) can be combined to effectively describe trajectories of change and periodization.
Filter the data
import polars as pl
= xray_decade.filter(pl.col("Decade") >= 1900) xray_filtered
= TimeSeries(time_series=xray_filtered, time_col='Decade', values_col='RF') xray_filtered_ts
; xray_filtered_ts.timeviz_vnc()