from PdmContext.ContextGeneration import ContextGenerator
from PdmContext.utils.structure import Context
Context is used here to provide a better understanding of the different cases the data are used each time.
Essentially Context represents the data (CD), existing in a time window, and their relationships (CR), where the relationships are extracted using causal discovery between the data (the causal discovery method can be user defined).
PdmContext.utils.structure.Context is used to represent such a context.
To this point CD contains data from different sources, and support different sample rates of signals, and event discrete data. The difference in sample rate is handled internally in the context generation process where all the series are mapped to a single series sample rate called target series (also referred to in the code and documentation as such):
The context support also data which are not numeric, but related to some kind of event (events that occur in time). These are often referred to as discrete data. To this end, the Context supports two types of such events:
The type of events is used to transform them into continuous space and add them to CD.
import pandas as pd
from random import random
# Create artificially timestamps
start = pd.to_datetime("2023-01-01 00:00:00")
timestamps = [start + pd.Timedelta(minutes=i) for i in range(17)]
# Create a real value time series
data1 = [random() for i in range(17)]
# create a variable of anomaly scores
anomaly1 = [0.2, 0.3, 0.2, 0.1, 0.2, 0.2, 0.1, 0.4, 0.8, 0.7, 0.7, 0.8, 0.7, 0.8, 1, 0.6, 0.7]
# Create a configuration Event series
# (for example in the below series there was a configuration event in 9th timestamo)
configur = [0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0]
# Create isolated Event time series (with occurencies in 1st and 13th timestamps)
spikes = [0 for i in range(len(data1))]
spikes[1] = 1
spikes[13] = 1
Provide the name of the target series, the time window length using the context_horizon parameter, and the Causality function to calculate CR (we leave this for later).
from PdmContext.utils.causal_discovery_functions import calculate_with_pc
con_gen = ContextGenerator(target="anomaly1", context_horizon="8 hours", Causalityfunct=calculate_with_pc, debug=False)
Iteratevly we pass the data to the Context Generator by calling collec_data() method. Each time we pass a single data sample or event from a single source. This method will return a Context object when we pass data with the name of the specified target name
source = "press"
for d1, an1, t1, sp1, con1 in zip(data1, anomaly1, timestamps, spikes, configur):
con_gen.collect_data(timestamp=t1, source=source, name="data1", value=d1)
if sp1 == 1:
con_gen.collect_data(timestamp=t1, source=source, name="spike", type="isolated")
if con1 == 1:
con_gen.collect_data(timestamp=t1, source=source, name="config", type="configuration")
contextTemp = con_gen.collect_data(timestamp=t1, source=source, name="anomaly1", value=an1)
/home/agiannoul/.local/lib/python3.10/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html from .autonotebook import tqdm as notebook_tqdm 2024-04-10 16:40:13,974 - /home/agiannoul/.local/lib/python3.10/site-packages/castle/backend/__init__.py[line:36] - INFO: You can use `os.environ['CASTLE_BACKEND'] = backend` to set the backend(`pytorch` or `mindspore`). 2024-04-10 16:40:14,006 - /home/agiannoul/.local/lib/python3.10/site-packages/castle/algorithms/__init__.py[line:36] - INFO: You are using ``pytorch`` as the backend.
We can plot the last context object that was returned (this object is referred to as the last timestamp) We see in the first plot the CD part of Context and in the second plot the CR (in the form of a graph).
contextTemp.plot()
1) config@press : 2023-01-01 00:08:00
Moreover, we can plot all the context of ContextGeneration (at a more abstract level) using the plot method.
con_gen.plot()
The values on this plot refer to the target value used to build each context sample, and the colors are refered to the relationships that exist in the CR of each context object). In that example, we can see that the anomaly1 series has an increase due to the config event (as seen from the CR)
The user can implement and provide to the Context Generator its own causality discovery method:
To do this simply needs to implement a Python function, that takes as a parameter:
Example: PdmContext.utils.causal_discovery_functions.calculatewithPc
import networkx as nx
# pip install castle
from castle.algorithms import PC
def calculatewithPc(names, data):
try:
pc = PC(variant='parallel')
pc.learn(data)
except Exception as e:
print(e)
return None
learned_graph = nx.DiGraph(pc.causal_matrix)
# Relabel the nodes
MAPPING = {k: n for k, n in zip(range(len(names)), names)}
learned_graph = nx.relabel_nodes(learned_graph, MAPPING, copy=True)
edges=learned_graph.edges
return edges
The current implementation supports connection with two databases (SQLlite3 and Influxdb) using the PdmContext.utils.dbconnector.SQLiteHandler, and PdmContext.utils.dbconnector.InfluxDBHandler
Using SQLite will create a database in the location of the main file.
Using Influxdb need to start the Influxdb services before starting: (For example in Linux: sudo service influxdb start)
Both databases Connections can be used with the implemented pipeline: PdmContext.Pipelines.ContextAndDatabase
Let's generate the same example as before but this time store it to the Database
from PdmContext.utils.dbconnector import SQLiteHandler
from PdmContext.Pipelines import ContextAndDatabase
con_gen = ContextGenerator(target="anomaly1", context_horizon="8", Causalityfunct=calculatewithPc, debug=False)
database = SQLiteHandler(db_name="ContextDatabase.db")
contextpipeline = ContextAndDatabase(context_generator_object=con_gen, databaseStore_object=database)
configur = [0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0]
data1 = [random() for i in range(len(configur))]
start = pd.to_datetime("2023-01-01 00:00:00")
timestamps = [start + pd.Timedelta(minutes=i) for i in range(len(data1))]
anomaly1 = [0.2, 0.3, 0.2, 0.1, 0.2, 0.2, 0.1, 0.4, 0.8, 0.7, 0.7, 0.8, 0.7, 0.8, 1, 0.6, 0.7]
spikes = [0 for i in range(len(data1))]
spikes[1] = 1
spikes[13] = 1
source = "press"
for d1, an1, t1, sp1, con1 in zip(data1, anomaly1, timestamps, spikes, configur):
contextpipeline.collect_data(timestamp=t1, source=source, name="data1", value=d1)
if sp1 == 1:
contextpipeline.collect_data(timestamp=t1, source=source, name="spike", type="isolated")
if con1 == 1:
contextpipeline.collect_data(timestamp=t1, source=source, name="config", type="configuration")
contextTemp = contextpipeline.collect_data(timestamp=t1, source=source, name="anomaly1", value=an1)
contextpipeline.Contexter.plot()
Now we can acces the Context objexcts stored in the Database:
database = SQLiteHandler(db_name="ContextDatabase.db")
traget_name = "anomaly1"
contextlist = database.get_all_context_by_target(traget_name)
print(len(contextlist))
17
We could also plot the contexts using the helper function from PdmContext.utils.showcontext.show_Context_list
from PdmContext.utils.showcontext import show_context_list
show_context_list(contextlist, target_text=traget_name)
We can use filter to exclude some relationships from the plot (this is quite useful when many data series are involved) Although there is no practical use in our example, we will now exclude config->anomaly1 relationships and keep only anomaly1->config, which are the same samples (this is done here only for presentation purposes)
show_context_list(contextlist, target_text=traget_name, filteredges=[["config", "anomaly1", ""]])
To compare two Context objects we need a similarity (or distance measure).
The user can implement its own distance function, which accepts only two parameters (two Context objects)
Below there is an example which uses the distance_cc (which is the SBD distance of between the CD and Jaccard similarity between the CR of the two contexts weighted by a factor a and b=1-a)
In this example we builr our own distance function my_distance(c1: Context,c2 Context) by using specific a and b
from PdmContext.utils.distances import distance_cc
def my_distance(c1:Context,c2:Context):
return distance_cc(c1,c2,a=0)
# JACCARD similarity between the CR components
print(my_distance(contextlist[0],contextlist[-1]))
print(my_distance(contextlist[-2],contextlist[-1]))
0 1.0
Using the PdmContext.ContextClustering.DBscanContextStreamClustering we can cluster over Context objects.
Clustering over context object has two main limitations:
Regarding 1) an example of a simplified DBscan algorithm for streaming data has been developed in PdmContext.ContextClustering.DBscanContextStreamClustering (we can iteratively feed cluster method using add_sample_to_cluster method)
While for 2) a sample of distance functions exist in PdmContext.utils.distances, and the user can define its own as shown previously
Creating a PdmContext.ContextClustering.DBscanContextStreamClustering object
from PdmContext.ContextClustering import DBscanContextStream
# use the distance function from before
clustering=DBscanContextStream(cluster_similarity_limit=0.7,distancefunc=my_distance)
for context_object in contextlist:
clustering.add_sample_to_cluster(context_object)
print(clustering.clusters_sets)
clustering.plot()
[[0, 1, 2, 3, 4], [5, 6, 7], [8, 9, 10, 11, 12, 13, 14, 15, 16]]
The use of the Cluster and ContextGenerator can be also implemented using the Pipeline ContextAndClustering (PdmContext.Pipelines.ContextAndClustering)
from PdmContext.Pipelines import ContextAndClustering
con_gen_2 = ContextGenerator(target="anomaly1", context_horizon="8 hours", Causalityfunct=calculatewithPc, debug=False)
clustering_2=DBscanContextStream(cluster_similarity_limit=0.7,min_points=2,distancefunc=my_distance)
contextpipeline2 = ContextAndClustering(context_generator_object=con_gen_2,Clustring_object=clustering_2)
source = "press"
for d1, an1, t1, sp1, con1 in zip(data1, anomaly1, timestamps, spikes, configur):
contextpipeline2.collect_data(timestamp=t1, source=source, name="data1", value=d1)
if sp1 == 1:
contextpipeline2.collect_data(timestamp=t1, source=source, name="spike", type="isolated")
if con1 == 1:
contextpipeline2.collect_data(timestamp=t1, source=source, name="config", type="configuration")
contextTemp = contextpipeline2.collect_data(timestamp=t1, source=source, name="anomaly1", value=an1)
contextpipeline2.clustering.plot()
contextpipeline2.Contexter.plot()
There are three pipeline that wrap the Database Connector, Context Generator, and Clustering (all using the same API of collect_data as Context Generator)
Because Context Generator works in a streaming fashion there are implemented simulators that can be used as helpers for the user (PdmContext.utils.simulate_stream)
Example (PdmContext.utils.simulate_stream.simulate_stream):
This simulator needs to pass two list
from PdmContext.utils.simulate_stream import simulate_stream
start2 = pd.to_datetime("2023-01-01 00:00:00")
timestamps2 = [start2 + pd.Timedelta(minutes=i) for i in range(17)]
eventconf2=("config",[pd.to_datetime("2023-01-01 00:09:00")],"configuration")
spiketuples2=("spikes",[pd.to_datetime("2023-01-01 00:01:00"),pd.to_datetime("2023-01-01 00:13:00")],"isolated")
data1tuples2=("data1",[random() for i in range(17)],timestamps2)
anomaly1tuples2=("anomaly1", [0.2, 0.3, 0.2, 0.1, 0.2, 0.2, 0.1, 0.4, 0.8, 0.7, 0.7, 0.8, 0.7, 0.8, 1, 0.6, 0.7],timestamps2)
stream=simulate_stream([data1tuples2,anomaly1tuples2],[eventconf2,spiketuples2],"anomaly1")
contextpipeline3 = ContextGenerator(target="anomaly1", context_horizon="8 hours", Causalityfunct=calculatewithPc)
source="press"
for record in stream:
#print(record)
contextpipeline3.collect_data(timestamp=record["timestamp"], source=source, name=record["name"], type=record["type"],value=record["value"])
contextpipeline3.plot()
In the case of an existing dataframe with all data.
Example (PdmContext.utils.simulate_stream.simulate_from_df):
This simulator needs to pass two list
from PdmContext.utils.simulate_stream import simulate_from_df
df = pd.read_csv("dummy_data.csv",index_col=0)
df.index=pd.to_datetime(df.index)
print(df.head())
traget_name="anomaly1"
stream = simulate_from_df(df,[("configur","configuration"),("spikes","isolated")], traget_name)
contextpipeline4 = ContextGenerator(target="anomaly1", context_horizon="8 hours", Causalityfunct=calculatewithPc)
source = "press"
for record in stream:
contextpipeline4.collect_data(timestamp=record["timestamp"], source=source, name=record["name"], type=record["type"],value=record["value"])
contextpipeline4.plot()
data1 configur anomaly1 spikes 2023-01-01 00:00:00 0.263462 0 0.2 0 2023-01-01 00:01:00 0.827615 0 0.3 1 2023-01-01 00:02:00 0.381941 0 0.2 0 2023-01-01 00:03:00 0.327193 0 0.1 0 2023-01-01 00:04:00 0.837698 0 0.2 0
Based on the edges in the Contexts CR (generated by Causal Discovery), we can try to interpret the target series behavior.
For example, having the below case of two configuration events and one isolated along with an anomaly score.
from PdmContext.ContextGeneration import ContextGenerator
from PdmContext.utils.causal_discovery_functions import calculate_with_pc
from PdmContext.utils.simulate_stream import simulate_from_df
from random import random
import matplotlib.pyplot as plt
import pandas as pd
size=100
isoEv1=[0 for i in range(size)]
confevent1=[0 for i in range(size)]
confevent2=[0 for i in range(size)]
noise=random()/10
start = pd.to_datetime("2023-01-01 00:00:00")
timestamps = [start + pd.Timedelta(hours=i) for i in range(size)]
confevent1[31]=1
confevent2[33]=1
isoEv1[69]=1
score=[1+random()/10 for i in range(30)]+ [1+(i/5)+random()/10 for i in range(5)] +[2+random()/10 for i in range(65)]
score[70]+=1
contextgenerator=ContextGenerator("score",context_horizon="100",Causalityfunct=calculate_with_pc)
dfdata={
"score":score,
"confEv1":confevent1,
"confEv2":confevent2,
"isoEv1":isoEv1,
}
df=pd.DataFrame(dfdata,index=timestamps)
df.plot()
plt.show()
stream = simulate_from_df(df,eventTypes=[("isoEv1","isolated"),("confEv1","configuration"),("confEv2","configuration")],target_name="score")
source="press"
for record in stream:
contextgenerator.collect_data(timestamp=record["timestamp"], source=source, name=record["name"], type=record["type"],value=record["value"])
contextgenerator.plot_interpretation()
listcontexts=contextgenerator.contexts
In the two plots we can observe the raw data (upper), and the interpretation (lower). Regarding the interpertation plot in the lower part, due to space limitation we observe only a part of the interpertations.
Looking closer to the time where the increse start and the when a spike occurs, we can take a better understanding of the interpertation.
Below we will plot the CD and CR part of the context. For the CD part the data series will be showed and for CR part, the graph structure of the causality discovery (left) along with interpertation (right) will be depicted.
listcontexts[35].plot()
1) confEv1@press : 2023-01-02 07:00:00 2) confEv2@press : 2023-01-02 09:00:00
We observe two interpertations, confEv2 and confEv1 where the confEv1 starting cause the target series (score), before confEv2 (based on the timestamp).
Using this timestamp we can conclude better on what may cause increase in the score.
Similarly in the next example, where the spike occured. Clearly the spike is due to the isoEv1 occurance, but the general rise in the score is still caused from confEv1 and confEv2. So the interpertation again will contain all three with a timestamps, indicating when this cause started in time.
listcontexts[70].plot()
1) confEv1@press : 2023-01-02 07:00:00 2) confEv2@press : 2023-01-02 09:00:00 3) isoEv1@press : 2023-01-03 22:00:00