sktime.datasets.base

Utilities for loading datasets

sktime.datasets.base.load_UCR_UEA_dataset(name, split=None, return_X_y=False, extract_path=None)[source]

Load dataset from UCR UEA time series classification repository. Downloads and extracts dataset if not already downloaded.

Parameters
  • name (str) – Name of data set

  • split (None or str{"train", "test"}, optional (default=None)) – Whether to load the train or test partition of the problem. By default it loads both.

  • return_X_y (bool, optional (default=False)) – If True, returns (features, target) separately instead of a single dataframe with columns for features and the target.

  • extract_path (str, optional (default=None)) – Default extract path is sktime/datasets/data/

Returns

  • X (pandas DataFrame with m rows and c columns) – The time series data for the problem with m cases and c dimensions

  • y (numpy array) – The class labels for each case in X

sktime.datasets.base.load_acsf1(split=None, return_X_y=False)[source]

Loads the power consumption of typical appliances time series classification problem and returns X and y.

Parameters
  • split (None or str{"train", "test"}, optional (default=None)) – Whether to load the train or test partition of the problem. By default it loads both.

  • return_X_y (bool, optional (default=False)) – If True, returns (features, target) separately instead of a single dataframe with columns for features and the target.

Returns

  • X (pandas DataFrame with m rows and c columns) – The time series data for the problem with m cases and c dimensions

  • y (numpy array) – The class labels for each case in X

  • Details

  • ——-

  • Dimensionality (univariate)

  • Series length (1460)

  • Train cases (100)

  • Test cases (100)

  • Number of classes (10)

  • The dataset contains the power consumption of typical appliances.

  • The recordings are characterized by long idle periods and some high bursts

  • of energy consumption when the appliance is active.

  • The classes correspond to 10 categories of home appliances;

  • mobile phones (via chargers), coffee machines, computer stations

  • (including monitor), fridges and freezers, Hi-Fi systems (CD players),

  • lamp (CFL), laptops (via chargers), microwave ovens, printers, and

  • televisions (LCD or LED).”

  • Dataset details (http://www.timeseriesclassification.com/description.php?Dataset)

  • =ACSF1

sktime.datasets.base.load_airline()[source]

Load the airline univariate time series dataset [1].

Returns

  • y (pd.Series) – Time series

  • Details

  • ——-

  • The classic Box & Jenkins airline data. Monthly totals of international

  • airline passengers, 1949 to 1960.

  • Dimensionality (univariate)

  • Series length (144)

  • Frequency (Monthly)

  • Number of cases (1)

Notes

This data shows an increasing trend, non-constant (increasing) variance and periodic, seasonal patterns.

References

..[1] Box, G. E. P., Jenkins, G. M. and Reinsel, G. C. (1976) Time Series

Analysis, Forecasting and Control. Third Edition. Holden-Day. Series G.

sktime.datasets.base.load_arrow_head(split=None, return_X_y=False)[source]

Loads the ArrowHead time series classification problem and returns X and y.

Parameters
  • split (None or str{"train", "test"}, optional (default=None)) – Whether to load the train or test partition of the problem. By default it loads both.

  • return_X_y (bool, optional (default=False)) – If True, returns (features, target) separately instead of a single dataframe with columns for features and the target.

Returns

  • X (pandas DataFrame with m rows and c columns) – The time series data for the problem with m cases and c dimensions

  • y (numpy array) – The class labels for each case in X

  • Details

  • ——-

  • Dimensionality (univariate)

  • Series length (251)

  • Train cases (36)

  • Test cases (175)

  • Number of classes (3)

  • The arrowhead data consists of outlines of the images of arrowheads. The

  • shapes of the

  • projectile points are converted into a time series using the angle-based

  • method. The

  • classification of projectile points is an important topic in

  • anthropology. The classes

  • are based on shape distinctions such as the presence and location of a

  • notch in the

  • arrow. The problem in the repository is a length normalised version of

  • that used in

  • Ye09shapelets. The three classes are called “Avonlea”, “Clovis” and “Mix”.”

  • Dataset details (http://timeseriesclassification.com/description.php)

  • ?Dataset=ArrowHead

sktime.datasets.base.load_basic_motions(split=None, return_X_y=False)[source]

Loads the BasicMotions time series classification problem and returns X and y.

Parameters
  • split (None or str{"train", "test"}, optional (default=None)) – Whether to load the train or test partition of the problem. By default it loads both.

  • return_X_y (bool, optional (default=False)) – If True, returns (features, target) separately instead of a single dataframe with columns for features and the target.

Returns

  • X (pandas DataFrame with m rows and c columns) – The time series data for the problem with m cases and c dimensions

  • y (numpy array) – The class labels for each case in X

  • Details

  • ——-

  • Dimensionality (univariate)

  • Series length (100)

  • Train cases (40)

  • Test cases (40)

  • Number of classes (4)

  • The data was generated as part of a student project where four students performed

  • four activities whilst wearing a smart watch. The watch collects 3D accelerometer

  • and a 3D gyroscope It consists of four classes, which are walking, resting,

  • running and badminton. Participants were required to record motion a total of

  • five times, and the data is sampled once every tenth of a second, for a ten second

  • period.

  • Dataset details (http://www.timeseriesclassification.com/description.php?Dataset)

  • =BasicMotions

sktime.datasets.base.load_gunpoint(split=None, return_X_y=False)[source]

Loads the GunPoint time series classification problem and returns X and y :param split: Whether to load the train or test partition of the problem. By

default it loads both.

Parameters

return_X_y (bool, optional (default=False)) – If True, returns (features, target) separately instead of a single dataframe with columns for features and the target.

Returns

  • X (pandas DataFrame with m rows and c columns) – The time series data for the problem with m cases and c dimensions

  • y (numpy array) – The class labels for each case in X

  • Details

  • ——-

  • Dimensionality (univariate)

  • Series length (150)

  • Train cases (50)

  • Test cases (150)

  • Number of classes (2)

  • This dataset involves one female actor and one male actor making a

  • motion with their

  • hand. The two classes are (Gun-Draw and Point: For Gun-Draw the actors)

  • have their

  • hands by their sides. They draw a replicate gun from a hip-mounted

  • holster, point it

  • at a target for approximately one second, then return the gun to the

  • holster, and

  • their hands to their sides. For Point the actors have their gun by their

  • sides.

  • They point with their index fingers to a target for approximately one

  • second, and

  • then return their hands to their sides. For both classes, we tracked the

  • centroid

  • of the actor’s right hands in both X- and Y-axes, which appear to be highly

  • correlated. The data in the archive is just the X-axis.

  • Dataset details (http://timeseriesclassification.com/description.php)

  • ?Dataset=GunPoint

sktime.datasets.base.load_italy_power_demand(split=None, return_X_y=False)[source]

Loads the ItalyPowerDemand time series classification problem and returns X and y

Parameters
  • split (None or str{"train", "test"}, optional (default=None)) – Whether to load the train or test partition of the problem. By default it loads both.

  • return_X_y (bool, optional (default=False)) – If True, returns (features, target) separately instead of a single dataframe with columns for features and the target.

Returns

  • X (pandas DataFrame with m rows and c columns) – The time series data for the problem with m cases and c dimensions

  • y (numpy array) – The class labels for each case in X

  • Details

  • ——-

  • Dimensionality (univariate)

  • Series length (24)

  • Train cases (67)

  • Test cases (1029)

  • Number of classes (2)

  • The data was derived from twelve monthly electrical power demand time

  • series from Italy and

  • first used in the paper “Intelligent Icons (Integrating Lite-Weight Data)

  • Mining and

  • Visualization into GUI Operating Systems”. The classification task is to

  • distinguish days

  • from Oct to March (inclusive) from April to September.

  • Dataset details (http://timeseriesclassification.com/description.php)

  • ?Dataset=ItalyPowerDemand

sktime.datasets.base.load_japanese_vowels(split=None, return_X_y=False)[source]

Loads the JapaneseVowels time series classification problem and returns X and y.

Parameters
  • split (None or str{"train", "test"}, optional (default=None)) – Whether to load the train or test partition of the problem. By

  • it loads both. (default) –

  • return_X_y (bool, optional (default=False)) – If True, returns (features, target) separately instead of a single dataframe with columns for features and the target.

Returns

  • X (pandas DataFrame with m rows and c columns) – The time series data for the problem with m cases and c dimensions

  • y (numpy array) – The class labels for each case in X

  • Details

  • ——-

  • Dimensionality (multivariate, 12)

  • Series length (29)

  • Train cases (270)

  • Test cases (370)

  • Number of classes (9)

  • A UCI Archive dataset. 9 Japanese-male speakers were recorded saying

  • the vowels ‘a’ and ‘e’. A ‘12-degree

  • linear prediction analysis’ is applied to the raw recordings to

  • obtain time-series with 12 dimensions, a

  • originally a length between 7 and 29. In this dataset, instances

  • have been padded to the longest length,

  • 29. The classification task is to predict the speaker. Therefore,

  • each instance is a transformed utterance,

  • 12*29 values with a single class label attached, [1…9]. The given

  • training set is comprised of 30

  • utterances for each speaker, however the test set has a varied

  • distribution based on external factors of

  • timing and experimental availability, between 24 and 88 instances per

  • speaker. Reference (M. Kudo, J. Toyama)

  • and M. Shimbo. (1999). “Multidimensional Curve Classification Using

  • Passing-Through Regions”. Pattern

  • Recognition Letters, Vol. 20, No. 11–13, pages 1103–1111.

  • Dataset details (http://timeseriesclassification.com/description.php)

  • ?Dataset=JapaneseVowels

sktime.datasets.base.load_longley(y_name='TOTEMP')[source]

Load the Longley multivariate time series dataset for forecasting with exogenous variables.

Parameters

y_name (str, optional (default="TOTEMP")) – Name of target variable (y)

Returns

  • y (pandas.Series) – The target series to be predicted.

  • X (pandas.DataFrame) – The exogenous time series data for the problem.

  • Details

  • ——-

  • This dataset contains various US macroeconomic variables from 1947 to

  • 1962 that are known to be highly

  • collinear.

  • Dimensionality (multivariate, 6)

  • Series length (16)

  • Frequency (Yearly)

  • Number of cases (1)

  • Variable description

  • TOTEMP - Total employment

  • GNPDEFL - Gross national product deflator

  • GNP - Gross national product

  • UNEMP - Number of unemployed

  • ARMED - Size of armed forces

  • POP - Population

References

1

Longley, J.W. (1967) “An Appraisal of Least Squares Programs for the Electronic Computer from the Point of View of the User.” Journal of the American Statistical Association. 62.319, 819-41. (https://www.itl.nist.gov/div898/strd/lls/data/LINKS/DATA/Longley.dat)

sktime.datasets.base.load_lynx()[source]

Load the lynx univariate time series dataset for forecasting.

Returns

  • y (pandas Series/DataFrame) – Lynx sales dataset

  • Details

  • ——-

  • The annual numbers of lynx trappings for 1821–1934 in Canada. This

  • time-series records the number of skins of

  • predators (lynx) that were collected over several years by the Hudson’s

  • Bay Company. The dataset was

  • taken from Brockwell & Davis (1991) and appears to be the series

  • considered by Campbell & Walker (1977).

  • Dimensionality (univariate)

  • Series length (114)

  • Frequency (Yearly)

  • Number of cases (1)

Notes

This data shows aperiodic, cyclical patterns, as opposed to periodic, seasonal patterns.

References

1

Becker, R. A., Chambers, J. M. and Wilks, A. R. (1988). The New S

Language. Wadsworth & Brooks/Cole.

2

Campbell, M. J. and Walker, A. M. (1977). A Survey of statistical

work on the Mackenzie River series of annual Canadian lynx trappings for the years 1821–1934 and a new analysis. Journal of the Royal Statistical Society series A, 140, 411–431.

sktime.datasets.base.load_osuleaf(split=None, return_X_y=False)[source]

Loads the OSULeaf time series classification problem and returns X and y

Parameters
  • split (None or str{"train", "test"}, optional (default=None)) – Whether to load the train or test partition of the problem. By default it loads both.

  • return_X_y (bool, optional (default=False)) – If True, returns (features, target) separately instead of a single dataframe with columns for features and the target.

Returns

  • X (pandas DataFrame with m rows and c columns) – The time series data for the problem with m cases and c dimensions

  • y (numpy array) – The class labels for each case in X

  • Details

  • ——-

  • Dimensionality (univariate)

  • Series length (427)

  • Train cases (200)

  • Test cases (242)

  • Number of classes (6)

  • The OSULeaf data set consist of one dimensional outlines of leaves.

  • The series were obtained by color image segmentation and boundary

  • extraction (in the anti-clockwise direction) from digitized leaf images

  • of six classes (Acer Circinatum, Acer Glabrum, Acer Macrophyllum,)

  • Acer Negundo, Quercus Garryanaand Quercus Kelloggii for the MSc thesis

  • ”Content-Based Image Retrieval (Plant Species Identification” by A Grandhi.)

  • Dataset details (http://www.timeseriesclassification.com/description.php)

  • ?Dataset=OSULeaf

sktime.datasets.base.load_shampoo_sales()[source]

Load the shampoo sales univariate time series dataset for forecasting.

Returns

  • y (pandas Series/DataFrame) – Shampoo sales dataset

  • Details

  • ——-

  • This dataset describes the monthly number of sales of shampoo over a 3

  • year period.

  • The units are a sales count.

  • Dimensionality (univariate)

  • Series length (36)

  • Frequency (Monthly)

  • Number of cases (1)

References

1

Makridakis, Wheelwright and Hyndman (1998) Forecasting: methods

and applications,

John Wiley & Sons: New York. Chapter 3.

sktime.datasets.base.load_uschange(y_name='Consumption')[source]

Load the multivariate time series dataset for forecasting Growth rates of personal consumption and personal income.

Returns

  • y (pandas Series) – selected column, default consumption

  • X (pandas Dataframe) – columns with explanatory variables

  • Details

  • ——-

  • Percentage changes in quarterly personal consumption expenditure,

  • personal disposable income, production, savings and the

  • unemployment rate for the US, 1960 to 2016.

Dimensionality: multivariate Columns: [‘Quarter’, ‘Consumption’, ‘Income’, ‘Production’,

‘Savings’, ‘Unemployment’]

Series length: 188 Frequency: Quarterly Number of cases: 1

Notes

This data shows an increasing trend, non-constant (increasing) variance and periodic, seasonal patterns.

References

..fpp2: Data for “Forecasting: Principles and Practice” (2nd Edition)