Getting started#
This section is here to help you getting started with Skchange. It covers the fundamental concepts of the library in a brief and concise way.
Installation#
pip install skchange
To make full use of the library, you can install the optional Numba dependency. This will speed up the computation of the algorithms in Skchange, often by as much as 10-100 times.
pip install skchange[numba]
Change detection#
The task#
Change detection is the task of identifying abrupt changes in the distribution of a time series. The goal is to estimate the time points at which the distribution changes. These points are called change points (or change-points or changepoints).
Here is an example of two changes in the mean of a Gaussian time series with unit variance.
Changes may occur in much more complex ways. For example, changes can affect:
Variance.
Shape of the distribution.
Auto-correlation.
Relationships between variables in multivariate time series.
An unknown, small portion of variables in a high-dimensional time series.
Skchange supports detecting changes in all of these scenarios, amongst others.
Composable change detectors#
Skchange follows a familiar scikit-learn-type API and is compatible with Sktime.
Here’s an example of a change detector:
[1]:
from skchange.change_detectors import MovingWindow
from skchange.change_scores import CUSUM
detector = MovingWindow(
change_score=CUSUM(),
penalty=10,
)
detector
[1]:
MovingWindow(change_score=CUSUM(), penalty=10)Please rerun this cell to show the HTML repr or trust the notebook.
MovingWindow(change_score=CUSUM(), penalty=10)
CUSUM()
CUSUM()
Let us look at each each part of the detector in more detail:
change_score
: Represents the choice of feature to detect changes in.CUSUM
is a popular choice for detecting changes in the mean of a time series.penalty
: Used to control the complexity of the change point model. The higher the penalty, the fewer change points will be detected.detector
: The search algorithm for detecting change points. It governs which data intervals the change score is evaluated on and how the results are compiled to a final set of detected change points.
In Skchange, all detectors follow the same pattern. They are composed of some kind of score to be evaluated on data intervals, and a penalty. You can read more about the core components of Skchange in the Concepts section.
fit
#
After initialising your detector of choice, you need to fit it to training data before you can use it to detect change points.
Here are some 3-dimensional Gaussian toy data with four segments with different means vectors.
[2]:
import numpy as np
from skchange.datasets import generate_changing_data
n = 300
cpts = [100, 140, 220]
means = [
np.array([0.0, 0.0, 0.0]),
np.array([8.0, 0.0, 0.0]),
np.array([0.0, 0.0, 0.0]),
np.array([2.0, 3.0, 5.0]),
]
x = generate_changing_data(n, changepoints=cpts, means=means, random_state=8)
x.columns = ["var0", "var1", "var2"]
x.index.name = "time"
x
[2]:
var0 | var1 | var2 | |
---|---|---|---|
time | |||
0 | 0.091205 | 1.091283 | -1.946970 |
1 | -1.386350 | -2.296492 | 2.409834 |
2 | 1.727836 | 2.204556 | 0.794828 |
3 | 0.976421 | -1.183427 | 1.916364 |
4 | -1.123327 | -0.664035 | -0.378359 |
... | ... | ... | ... |
295 | 0.325434 | 2.015049 | 4.939516 |
296 | 3.485036 | 3.118221 | 6.393023 |
297 | 2.517864 | 3.445919 | 3.264219 |
298 | 2.290727 | 2.758822 | 4.492490 |
299 | 1.230467 | 1.715009 | 4.918493 |
300 rows × 3 columns
Here is what the data looks like:
[3]:
import plotly.express as px
px.line(x)
As in scikit-learn, the role of fit
is to estimate certain parameters of the detector before it can be used for detection tasks on test data. In Skchange, all currently supported detectors have empty fit
methods, but this may change in the future.
[4]:
detector.fit(x)
[4]:
MovingWindow(change_score=CUSUM(), penalty=10)Please rerun this cell to show the HTML repr or trust the notebook.
MovingWindow(change_score=CUSUM(), penalty=10)
CUSUM()
CUSUM()
predict
#
After fitting the detector, you can use it to detect change points. The predict
method returns the integer locations of detected change points.
[5]:
detections = detector.predict(x)
detections
[5]:
ilocs | |
---|---|
0 | 100 |
1 | 140 |
2 | 220 |
Note that change points indicate the start of a new segment.
transform
#
Alternatively, you can use the transform
method to label the data according to the change point segmentation.
[6]:
labels = detector.transform(x)
labels
[6]:
labels | |
---|---|
time | |
0 | 0 |
1 | 0 |
2 | 0 |
3 | 0 |
4 | 0 |
... | ... |
295 | 3 |
296 | 3 |
297 | 3 |
298 | 3 |
299 | 3 |
300 rows × 1 columns
[7]:
px.line(labels)
This is useful for e.g. grouping operations per segment:
[8]:
x["label"] = labels
x.groupby("label").agg(["mean", "std"])
[8]:
var0 | var1 | var2 | ||||
---|---|---|---|---|---|---|
mean | std | mean | std | mean | std | |
label | ||||||
0 | -0.145056 | 1.038400 | 0.078223 | 1.107580 | 0.016803 | 1.013129 |
1 | 8.085414 | 0.938503 | -0.181219 | 1.152032 | 0.205081 | 0.881243 |
2 | 0.143322 | 1.136743 | 0.126735 | 0.975529 | 0.066954 | 1.085700 |
3 | 2.248388 | 0.919702 | 2.959066 | 1.029075 | 4.851858 | 1.018683 |
transform_scores
#
Some detectors also support the transform_scores
method, which returns the penalised change scores for each data point. This is the case for MovingWindow
.
[9]:
detection_scores = detector.transform_scores(x)
detection_scores
[9]:
bandwidth | 20 |
---|---|
time | |
0 | NaN |
1 | -6.943667 |
2 | -7.688373 |
3 | -8.703367 |
4 | -6.636503 |
... | ... |
295 | -8.910835 |
296 | -9.046409 |
297 | -8.271702 |
298 | -8.353627 |
299 | -7.787641 |
300 rows × 1 columns
[10]:
import plotly.express as px
px.line(detection_scores)