Contents:¶

  • Context and Data types
  • Context Generation on dummy data
    • Creating Cotext Object
  • Causality Discovery
  • Database Connections
  • Distance Function
  • Clustering Context
  • Pipelines
  • Simulators
    • Simulator stream
    • Simulator using pandas DataFrame
  • Interpretation
  • Limitations

The Documentation can be found here¶

Getting Started with Context¶

In [1]:
from PdmContext.ContextGeneration import ContextGenerator
from PdmContext.utils.structure import Context

Context and Data types ¶

Context is used here to provide a better understanding of the different cases the data are used each time.

Essentially Context represents the data (CD), existing in a time window, and their relationships (CR), where the relationships are extracted using causal discovery between the data (the causal discovery method can be user defined).

PdmContext.utils.structure.Context is used to represent such a context.

Data Types¶

Continuous (analog, real, Univariate series ...):¶

To this point CD contains data from different sources, and support different sample rates of signals, and event discrete data. The difference in sample rate is handled internally in the context generation process where all the series are mapped to a single series sample rate called target series (also referred to in the code and documentation as such):

  1. For series with a sample rate higher than that of the target, the samples between two timestamps of target series, are aggregated (mean)
  2. For series with lower sample rates, repetition of their values is used.

Event Data:¶

The context support also data which are not numeric, but related to some kind of event (events that occur in time). These are often referred to as discrete data. To this end, the Context supports two types of such events:

  1. isolated: Events that have an instant impact when they occur.
  2. configuration: Events that refer to a configuration change that has an impact after its occurrence.

The type of events is used to transform them into continuous space and add them to CD.

title

Generating some dummy data to test ContextGeneration ¶

In [2]:
import pandas as pd
from random import random

# Create artificially timestamps
start = pd.to_datetime("2023-01-01 00:00:00")
timestamps = [start + pd.Timedelta(minutes=i) for i in range(17)]

# Create a real value time series
data1 = [random() for i in range(17)]

# create a variable of anomaly scores
anomaly1 = [0.2, 0.3, 0.2, 0.1, 0.2, 0.2, 0.1, 0.4, 0.8, 0.7, 0.7, 0.8, 0.7, 0.8, 1, 0.6, 0.7]

# Create a configuration Event series 
# (for example in the below series there was a configuration event in 9th timestamo)
configur = [0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0]

# Create isolated Event time series (with occurencies in 1st and 13th timestamps)
spikes = [0 for i in range(len(data1))]
spikes[1] = 1
spikes[13] = 1

Creating a Context Generation object ¶

Provide the name of the target series, the time window length using the context_horizon parameter, and the Causality function to calculate CR (we leave this for later).

In [3]:
from PdmContext.utils.causal_discovery_functions import calculate_with_pc


con_gen = ContextGenerator(target="anomaly1", context_horizon="8 hours", Causalityfunct=calculate_with_pc, debug=False)

Iteratevly we pass the data to the Context Generator by calling collec_data() method. Each time we pass a single data sample or event from a single source. This method will return a Context object when we pass data with the name of the specified target name

In [4]:
source = "press"
for d1, an1, t1, sp1, con1 in zip(data1, anomaly1, timestamps, spikes, configur):
    con_gen.collect_data(timestamp=t1, source=source, name="data1", value=d1)
    if sp1 == 1:
        con_gen.collect_data(timestamp=t1, source=source, name="spike", type="isolated")
    if con1 == 1:
        con_gen.collect_data(timestamp=t1, source=source, name="config", type="configuration")
    contextTemp = con_gen.collect_data(timestamp=t1, source=source, name="anomaly1", value=an1)
/home/agiannoul/.local/lib/python3.10/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
  from .autonotebook import tqdm as notebook_tqdm
2024-04-10 16:40:13,974 - /home/agiannoul/.local/lib/python3.10/site-packages/castle/backend/__init__.py[line:36] - INFO: You can use `os.environ['CASTLE_BACKEND'] = backend` to set the backend(`pytorch` or `mindspore`).
2024-04-10 16:40:14,006 - /home/agiannoul/.local/lib/python3.10/site-packages/castle/algorithms/__init__.py[line:36] - INFO: You are using ``pytorch`` as the backend.

We can plot the last context object that was returned (this object is referred to as the last timestamp) We see in the first plot the CD part of Context and in the second plot the CR (in the form of a graph).

In [5]:
contextTemp.plot()
1) config@press : 2023-01-01 00:08:00

Moreover, we can plot all the context of ContextGeneration (at a more abstract level) using the plot method.

In [6]:
con_gen.plot()

The values on this plot refer to the target value used to build each context sample, and the colors are refered to the relationships that exist in the CR of each context object). In that example, we can see that the anomaly1 series has an increase due to the config event (as seen from the CR)

Causality Discovery ¶

The user can implement and provide to the Context Generator its own causality discovery method:

To do this simply needs to implement a Python function, that takes as a parameter:

  1. A list with names of time series data
  2. The time series data in the form of a 2D array

Example: PdmContext.utils.causal_discovery_functions.calculatewithPc

In [7]:
import networkx as nx
# pip install castle
from castle.algorithms import PC
def calculatewithPc(names, data):
    try:
        pc = PC(variant='parallel')
        pc.learn(data)
    except Exception as e:
        print(e)
        return None

    learned_graph = nx.DiGraph(pc.causal_matrix)
    # Relabel the nodes
    MAPPING = {k: n for k, n in zip(range(len(names)), names)}
    learned_graph = nx.relabel_nodes(learned_graph, MAPPING, copy=True)
    edges=learned_graph.edges
    return edges

Database Connections ¶

The current implementation supports connection with two databases (SQLlite3 and Influxdb) using the PdmContext.utils.dbconnector.SQLiteHandler, and PdmContext.utils.dbconnector.InfluxDBHandler

Using SQLite will create a database in the location of the main file.

Using Influxdb need to start the Influxdb services before starting: (For example in Linux: sudo service influxdb start)

Both databases Connections can be used with the implemented pipeline: PdmContext.Pipelines.ContextAndDatabase

Let's generate the same example as before but this time store it to the Database

In [8]:
from PdmContext.utils.dbconnector import SQLiteHandler
from PdmContext.Pipelines import ContextAndDatabase


con_gen = ContextGenerator(target="anomaly1", context_horizon="8", Causalityfunct=calculatewithPc, debug=False)
database = SQLiteHandler(db_name="ContextDatabase.db")
contextpipeline = ContextAndDatabase(context_generator_object=con_gen, databaseStore_object=database)

configur = [0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0]
data1 = [random() for i in range(len(configur))]
start = pd.to_datetime("2023-01-01 00:00:00")
timestamps = [start + pd.Timedelta(minutes=i) for i in range(len(data1))]
anomaly1 = [0.2, 0.3, 0.2, 0.1, 0.2, 0.2, 0.1, 0.4, 0.8, 0.7, 0.7, 0.8, 0.7, 0.8, 1, 0.6, 0.7]

spikes = [0 for i in range(len(data1))]
spikes[1] = 1
spikes[13] = 1

source = "press"
for d1, an1, t1, sp1, con1 in zip(data1, anomaly1, timestamps, spikes, configur):
    contextpipeline.collect_data(timestamp=t1, source=source, name="data1", value=d1)
    if sp1 == 1:
        contextpipeline.collect_data(timestamp=t1, source=source, name="spike", type="isolated")
    if con1 == 1:
        contextpipeline.collect_data(timestamp=t1, source=source, name="config", type="configuration")
    contextTemp = contextpipeline.collect_data(timestamp=t1, source=source, name="anomaly1", value=an1)
contextpipeline.Contexter.plot()

Now we can acces the Context objexcts stored in the Database:

In [9]:
database = SQLiteHandler(db_name="ContextDatabase.db")
traget_name = "anomaly1"
contextlist = database.get_all_context_by_target(traget_name)
print(len(contextlist))
17

We could also plot the contexts using the helper function from PdmContext.utils.showcontext.show_Context_list

In [10]:
from PdmContext.utils.showcontext import show_context_list
show_context_list(contextlist, target_text=traget_name)

We can use filter to exclude some relationships from the plot (this is quite useful when many data series are involved) Although there is no practical use in our example, we will now exclude config->anomaly1 relationships and keep only anomaly1->config, which are the same samples (this is done here only for presentation purposes)

In [11]:
show_context_list(contextlist, target_text=traget_name, filteredges=[["config", "anomaly1", ""]])

Distance Function ¶

To compare two Context objects we need a similarity (or distance measure).

The user can implement its own distance function, which accepts only two parameters (two Context objects)

Below there is an example which uses the distance_cc (which is the SBD distance of between the CD and Jaccard similarity between the CR of the two contexts weighted by a factor a and b=1-a)

In this example we builr our own distance function my_distance(c1: Context,c2 Context) by using specific a and b

In [12]:
from PdmContext.utils.distances import distance_cc

def my_distance(c1:Context,c2:Context):

    return distance_cc(c1,c2,a=0)

# JACCARD similarity between the CR components
print(my_distance(contextlist[0],contextlist[-1]))
print(my_distance(contextlist[-2],contextlist[-1]))
0
1.0

Clustering Context ¶

Using the PdmContext.ContextClustering.DBscanContextStreamClustering we can cluster over Context objects.

Clustering over context object has two main limitations:

  1. streaming interface (when we want to cluster the context objects as they arrive from the Context Generator
  2. appropriate distance measurement

Regarding 1) an example of a simplified DBscan algorithm for streaming data has been developed in PdmContext.ContextClustering.DBscanContextStreamClustering (we can iteratively feed cluster method using add_sample_to_cluster method)

While for 2) a sample of distance functions exist in PdmContext.utils.distances, and the user can define its own as shown previously

Creating a PdmContext.ContextClustering.DBscanContextStreamClustering object

In [13]:
from PdmContext.ContextClustering import DBscanContextStream


# use the distance function from before
clustering=DBscanContextStream(cluster_similarity_limit=0.7,distancefunc=my_distance)

for context_object in contextlist:
    clustering.add_sample_to_cluster(context_object)

print(clustering.clusters_sets)
clustering.plot()
[[0, 1, 2, 3, 4], [5, 6, 7], [8, 9, 10, 11, 12, 13, 14, 15, 16]]

The use of the Cluster and ContextGenerator can be also implemented using the Pipeline ContextAndClustering (PdmContext.Pipelines.ContextAndClustering)

In [14]:
from PdmContext.Pipelines import ContextAndClustering

con_gen_2 = ContextGenerator(target="anomaly1", context_horizon="8 hours", Causalityfunct=calculatewithPc, debug=False)
clustering_2=DBscanContextStream(cluster_similarity_limit=0.7,min_points=2,distancefunc=my_distance)

contextpipeline2 = ContextAndClustering(context_generator_object=con_gen_2,Clustring_object=clustering_2)


source = "press"
for d1, an1, t1, sp1, con1 in zip(data1, anomaly1, timestamps, spikes, configur):
    contextpipeline2.collect_data(timestamp=t1, source=source, name="data1", value=d1)
    if sp1 == 1:
        contextpipeline2.collect_data(timestamp=t1, source=source, name="spike", type="isolated")
    if con1 == 1:
        contextpipeline2.collect_data(timestamp=t1, source=source, name="config", type="configuration")
    contextTemp = contextpipeline2.collect_data(timestamp=t1, source=source, name="anomaly1", value=an1)
contextpipeline2.clustering.plot()
contextpipeline2.Contexter.plot()

Pipelines ¶

There are three pipeline that wrap the Database Connector, Context Generator, and Clustering (all using the same API of collect_data as Context Generator)

  1. PdmContext.Pipelines.ContextAndClustering (wrap Context generator and feed its result to clustering)
  2. PdmContext.Pipelines import ContextAndDatabase (wrap Context generator and feed its result to the database)
  3. PdmContext.Pipelines import ContextAndClusteringAndDatabase (wrap Context generator and feed its result to database and clustering)

Simulators ¶

Because Context Generator works in a streaming fashion there are implemented simulators that can be used as helpers for the user (PdmContext.utils.simulate_stream)

Simulator stream ¶

Example (PdmContext.utils.simulate_stream.simulate_stream):

This simulator needs to pass two list

  1. Time series data: which contain tuples of with shape ( name: str, timestamps:list, values:list )
  2. Event series data: which contain tuples of : (name: str, occurrences :list of Dates), type:str)
  3. target name
In [15]:
from PdmContext.utils.simulate_stream import simulate_stream


start2 = pd.to_datetime("2023-01-01 00:00:00")
timestamps2 = [start2 + pd.Timedelta(minutes=i) for i in range(17)]

eventconf2=("config",[pd.to_datetime("2023-01-01 00:09:00")],"configuration")
spiketuples2=("spikes",[pd.to_datetime("2023-01-01 00:01:00"),pd.to_datetime("2023-01-01 00:13:00")],"isolated")

data1tuples2=("data1",[random() for i in range(17)],timestamps2)
anomaly1tuples2=("anomaly1", [0.2, 0.3, 0.2, 0.1, 0.2, 0.2, 0.1, 0.4, 0.8, 0.7, 0.7, 0.8, 0.7, 0.8, 1, 0.6, 0.7],timestamps2)

stream=simulate_stream([data1tuples2,anomaly1tuples2],[eventconf2,spiketuples2],"anomaly1")

contextpipeline3 =  ContextGenerator(target="anomaly1", context_horizon="8 hours", Causalityfunct=calculatewithPc)
source="press"


for record in stream:
    #print(record)
    contextpipeline3.collect_data(timestamp=record["timestamp"], source=source, name=record["name"], type=record["type"],value=record["value"])
contextpipeline3.plot()

Simulator using pandas DataFrame ¶

In the case of an existing dataframe with all data.

Example (PdmContext.utils.simulate_stream.simulate_from_df):

This simulator needs to pass two list

  1. A dataframe
  2. Which columns represent events and of what type example [("column1","isolated"),("column3","configuration")]
  3. Target name (existing in dataframes column)
In [16]:
from PdmContext.utils.simulate_stream import simulate_from_df

df = pd.read_csv("dummy_data.csv",index_col=0)
df.index=pd.to_datetime(df.index)
print(df.head())
traget_name="anomaly1"
stream = simulate_from_df(df,[("configur","configuration"),("spikes","isolated")], traget_name)

contextpipeline4 =  ContextGenerator(target="anomaly1", context_horizon="8 hours", Causalityfunct=calculatewithPc)

source = "press"
for record in stream:
    contextpipeline4.collect_data(timestamp=record["timestamp"], source=source, name=record["name"], type=record["type"],value=record["value"])
contextpipeline4.plot()
                        data1  configur  anomaly1  spikes
2023-01-01 00:00:00  0.263462         0       0.2       0
2023-01-01 00:01:00  0.827615         0       0.3       1
2023-01-01 00:02:00  0.381941         0       0.2       0
2023-01-01 00:03:00  0.327193         0       0.1       0
2023-01-01 00:04:00  0.837698         0       0.2       0

Interpretation ¶

Based on the edges in the Contexts CR (generated by Causal Discovery), we can try to interpret the target series behavior.

For example, having the below case of two configuration events and one isolated along with an anomaly score.

In [17]:
from PdmContext.ContextGeneration import ContextGenerator
from PdmContext.utils.causal_discovery_functions import calculate_with_pc
from PdmContext.utils.simulate_stream import simulate_from_df
from random import random
import matplotlib.pyplot as plt
import pandas as pd

size=100
isoEv1=[0 for i in range(size)]
confevent1=[0 for i in range(size)]
confevent2=[0 for i in range(size)]
noise=random()/10
start = pd.to_datetime("2023-01-01 00:00:00")
timestamps = [start + pd.Timedelta(hours=i) for i in range(size)]
confevent1[31]=1
confevent2[33]=1
isoEv1[69]=1
score=[1+random()/10 for i in range(30)]+ [1+(i/5)+random()/10 for i in range(5)] +[2+random()/10 for i in range(65)]

score[70]+=1
contextgenerator=ContextGenerator("score",context_horizon="100",Causalityfunct=calculate_with_pc)


dfdata={
    "score":score,
    "confEv1":confevent1,
    "confEv2":confevent2,
    "isoEv1":isoEv1,
}
df=pd.DataFrame(dfdata,index=timestamps)
df.plot()
plt.show()

stream = simulate_from_df(df,eventTypes=[("isoEv1","isolated"),("confEv1","configuration"),("confEv2","configuration")],target_name="score")
source="press"
for record in stream:
    contextgenerator.collect_data(timestamp=record["timestamp"], source=source, name=record["name"], type=record["type"],value=record["value"])
contextgenerator.plot_interpretation()
listcontexts=contextgenerator.contexts

In the two plots we can observe the raw data (upper), and the interpretation (lower). Regarding the interpertation plot in the lower part, due to space limitation we observe only a part of the interpertations.

Looking closer to the time where the increse start and the when a spike occurs, we can take a better understanding of the interpertation.

Below we will plot the CD and CR part of the context. For the CD part the data series will be showed and for CR part, the graph structure of the causality discovery (left) along with interpertation (right) will be depicted.

In [18]:
listcontexts[35].plot()
1) confEv1@press : 2023-01-02 07:00:00
2) confEv2@press : 2023-01-02 09:00:00

We observe two interpertations, confEv2 and confEv1 where the confEv1 starting cause the target series (score), before confEv2 (based on the timestamp).

Using this timestamp we can conclude better on what may cause increase in the score.

Similarly in the next example, where the spike occured. Clearly the spike is due to the isoEv1 occurance, but the general rise in the score is still caused from confEv1 and confEv2. So the interpertation again will contain all three with a timestamps, indicating when this cause started in time.

In [19]:
listcontexts[70].plot()
1) confEv1@press : 2023-01-02 07:00:00
2) confEv2@press : 2023-01-02 09:00:00
3) isoEv1@press : 2023-01-03 22:00:00

Limitations ¶

  1. distances not scalable
  2. Causality discovery is generally slow
  3. Context extraction may be not optimized (for time complexity)
In [ ]: