binder

Shapelets and the Shapelet Transform with sktime

Introduced in [1], a shapelet is a time series subsequences that is identified as being representative of class membership. Shapelets are a powerful approach for measuring phase-independent similarity between time series; they can occur at any point within a series and offer interpretable results for how matches occur. The original research extracted shapelets to build a decision tree classifier.

The example below illustrates how leaf shape can be represented as a one-dimensional time series (blue line) to distinguish between two species.[2]

5fc8734326ce492e955b3b3831ce2010 65b156db4d984ef9983b3dfb5ae63911

The highlighted red subsection of the time series (i.e., “subsequences”) above is the shapelet that distinguishes Verbena urticifolia from Urtica dioica.

The Shapelet Transform

Much research emphasis has been placed on shapelet-based approaches for time series classification (TSC) since the original research was proposed. The current state-of-the-art for shapelets is the shapelet transform (ST) [3, 4]. The transform improves upon the original use of shapelets by separating shapelet extraction from the classification algorithm, allowing interpretable phase-independent classification of time series with any standard classification algorithm (such as random/rotation forest, neural networks, nearest neighbour classifications, ensembles of all, etc.). To facilitate this, rather than recursively assessing data for the best shapelet, the transform evaluates candidate shapelets in a single procedure to rank them based on information gain. Then, given a set of k shapelets, a time series can be transformed into k features by calculating the distance from the series to each shapelet. By transforming a dataset in this manner any vector-based classification algorithm can be applied to a shapelet-transformed time series problem while the interpretability of shapelets is maintained through the ranked list of the best shapelets during transformation.

Shapelets can provide interpretable results, as seen in the figure below:

15c939443c574193bfcaf607bd398fb9

The shapelet has “discovered” where the two plant species distinctly differ. Urtica dioica has a stem that connects to the leaf at almost 90 degrees, whereas the stem of Verbena urticifolia connects to the leaf at a wider angle.

Having found shapelet, its distance to the nearest matching subsequence in all objects in the database can be recorded. Finally, a simple decision tree classifier can be built to determine whether an object \(Q\) has a subsequence within a certain distance from shapelet \(I\).

8ccae6eac6a7441db2a02b1cf8507737

References

[1] Ye, Lexiang, and Eamonn Keogh. “Time series shapelets: a novel technique that allows accurate, interpretable and fast classification.” Data mining and knowledge discovery 22, no. 1-2 (2011): 149-182.

[2] Ye, Lexiang, and Eamonn Keogh. “Time series shapelets: a new primitive for data mining.” In Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 947-956. 2009.

[3] Lines, Jason, Luke M. Davis, Jon Hills, and Anthony Bagnall. “A shapelet transform for time series classification.” In Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 289-297. ACM, 2012.

[4] Hills, Jon, Jason Lines, Edgaras Baranauskas, James Mapp, and Anthony Bagnall. “Classification of time series by shapelet transformation.” Data Mining and Knowledge Discovery 28, no. 4 (2014): 851-881.

[5] Bostrom, Aaron, and Anthony Bagnall. “Binary shapelet transform for multiclass time series classification.” In Transactions on Large-Scale Data-and Knowledge-Centered Systems XXXII, pp. 24-46. Springer, Berlin, Heidelberg, 2017.

Example: The Shapelet Transform in sktime

The following workbook demonstrates a full workflow of using the shapelet transform in sktime with a scikit-learn classifier with the OSU Leaf dataset, which consists of one dimensional outlines of six leaf classes: Acer Circinatum, Acer Glabrum, Acer Macrophyllum, Acer Negundo, Quercus Garryana and Quercus Kelloggii.

[1]:
from sktime.datasets import load_osuleaf
from sktime.transformations.panel.shapelets import ContractedShapeletTransform

train_x, train_y = load_osuleaf(split="train", return_X_y=True)
test_x, test_y = load_osuleaf(split="test", return_X_y=True)
[2]:
# How long (in minutes) to extract shapelets for.
# This is a simple lower-bound initially;
# once time is up, no further shapelets will be assessed
time_contract_in_mins = 1

# The initial number of shapelet candidates to assess per training series.
# If all series are visited and time remains on the contract then another
# pass of the data will occur
initial_num_shapelets_per_case = 10

# Whether or not to print on-going information about shapelet extraction.
# Useful for demo/debugging
verbose = 2

st = ContractedShapeletTransform(
    time_contract_in_mins=time_contract_in_mins,
    num_candidates_to_sample_per_case=initial_num_shapelets_per_case,
    verbose=verbose,
)
st.fit(train_x, train_y)
visiting series: 51 (#1)
2.5714950561523438
Candidate finished. 00:57 remaining
Candidate rejected. 00:-4 remaining
2.962433099746704
Candidate rejected. 00:-6 remaining
3.0671868324279785
Candidate finished. 00:48 remaining
Candidate finished. 00:46 remaining
Candidate finished. 00:43 remaining
3.130671262741089
Candidate finished. 00:40 remaining
Candidate finished. 00:38 remaining
Candidate finished. 00:36 remaining
Candidate finished. 00:33 remaining
visiting series: 55 (#2)
Candidate finished. 00:31 remaining
3.400134801864624
Candidate finished. 00:27 remaining
Candidate finished. 00:25 remaining
Candidate finished. 00:24 remaining
Candidate rejected. 00:-36 remaining
Candidate finished. 00:19 remaining
Candidate finished. 00:17 remaining
Candidate finished. 00:15 remaining
Candidate finished. 00:13 remaining
Candidate finished. 00:11 remaining
visiting series: 4 (#3)
Candidate finished. 00:08 remaining
Candidate finished. 00:06 remaining
Candidate finished. 00:03 remaining
Candidate finished. 00:00 remaining
No more time available! It's been 01:01
Stopping search
[2]:
ContractedShapeletTransform(num_candidates_to_sample_per_case=10,
                            time_contract_in_mins=1, verbose=2)
[3]:
%matplotlib inline
import matplotlib.pyplot as plt

# for each extracted shapelet (in descending order of quality/information gain)
for s in st.shapelets[0:5]:

    # summary info about the shapelet
    print(s)

    # plot the series that the shapelet was extracted from
    plt.plot(train_x.iloc[s.series_id, 0], "gray")

    # overlay the shapelet onto the full series
    plt.plot(
        list(range(s.start_pos, (s.start_pos + s.length))),
        train_x.iloc[s.series_id, 0][s.start_pos : s.start_pos + s.length],
        "r",
        linewidth=3.0,
    )
    plt.show()
Series ID: 51, start_pos: 109, length: 68, info_gain: 0.14713002000707054,
../_images/examples_shapelet_transform_3_1.png
Series ID: 55, start_pos: 75, length: 314, info_gain: 0.14402946691987872,
../_images/examples_shapelet_transform_3_3.png
Series ID: 51, start_pos: 299, length: 92, info_gain: 0.11488133254372052,
../_images/examples_shapelet_transform_3_5.png
Series ID: 51, start_pos: 30, length: 46, info_gain: 0.09575722605459835,
../_images/examples_shapelet_transform_3_7.png
Series ID: 4, start_pos: 311, length: 94, info_gain: 0.09097204859324504,
../_images/examples_shapelet_transform_3_9.png
[4]:
# for each extracted shapelet (in descending order of quality/information gain)
for i in range(0, min(len(st.shapelets), 5)):
    s = st.shapelets[i]
    # summary info about the shapelet
    print("#" + str(i) + ": " + str(s))

    # overlay shapelets
    plt.plot(
        list(range(s.start_pos, (s.start_pos + s.length))),
        train_x.iloc[s.series_id, 0][s.start_pos : s.start_pos + s.length],
    )

plt.show()
#0: Series ID: 51, start_pos: 109, length: 68, info_gain: 0.14713002000707054,
#1: Series ID: 55, start_pos: 75, length: 314, info_gain: 0.14402946691987872,
#2: Series ID: 51, start_pos: 299, length: 92, info_gain: 0.11488133254372052,
#3: Series ID: 51, start_pos: 30, length: 46, info_gain: 0.09575722605459835,
#4: Series ID: 4, start_pos: 311, length: 94, info_gain: 0.09097204859324504,
../_images/examples_shapelet_transform_4_1.png
[5]:
import time

from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline

from sktime.datasets import load_osuleaf

train_x, train_y = load_osuleaf(split="train", return_X_y=True)
test_x, test_y = load_osuleaf(split="test", return_X_y=True)

# example pipeline with 1 minute time limit
pipeline = Pipeline(
    [
        (
            "st",
            ContractedShapeletTransform(
                time_contract_in_mins=time_contract_in_mins,
                num_candidates_to_sample_per_case=10,
                verbose=False,
            ),
        ),
        ("rf", RandomForestClassifier(n_estimators=100)),
    ]
)

start = time.time()
pipeline.fit(train_x, train_y)
end_build = time.time()
preds = pipeline.predict(test_x)
end_test = time.time()

print("Results:")
print("Correct:")
correct = sum(preds == test_y)
print("\t" + str(correct) + "/" + str(len(test_y)))
print("\t" + str(correct / len(test_y)))
print("\nTiming:")
print("\tTo build:   " + str(end_build - start) + " secs")
print("\tTo predict: " + str(end_test - end_build) + " secs")
2.0774919986724854
3.094931125640869
3.2343239784240723
Results:
Correct:
        105/242
        0.43388429752066116

Timing:
        To build:   65.61358594894409 secs
        To predict: 6.453415870666504 secs

Generated by nbsphinx. The Jupyter notebook can be found here.