The Documentation can be found here
from PdmContext.ContextGeneration import ContextGenerator
from PdmContext.utils.structure import Context
Context is used here to provide a better understanding of the difference cases the data are each time.
In esense Context represent the data (CD), existing in a time window, and their relationships (CR), where the relationships are extracted using causal discovery between the data (the causal discovery method can be user defiend).
PdmContext.utils.structure.Context is used to Represent such a context.
To this point CD contain data from different sources, and support different sample rates of signals, and event discrete data. The difference in sample rate is handled internally in the context generation process where all the series are maped to a single series sample rate callse target series (also reffered to the code and documentation as such):
The context suppor also data which are not numeric, but related to some kind of event (events that occur in time). These are oftenly refered as discrete data. To this end the Context support two types of such events:
The type of events is used to tranform the in to contiuous space and add them to CD.
import pandas as pd
from random import random
# Create artificially timestamps
start = pd.to_datetime("2023-01-01 00:00:00")
timestamps = [start + pd.Timedelta(minutes=i) for i in range(17)]
# Create a real value time series
data1 = [random() for i in range(17)]
# create a variable of anomaly scores
anomaly1 = [0.2, 0.3, 0.2, 0.1, 0.2, 0.2, 0.1, 0.4, 0.8, 0.7, 0.7, 0.8, 0.7, 0.8, 1, 0.6, 0.7]
# Create a configuration Event series
# (for example in the below series there was a configuration event in 9th timestamo)
configur = [0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0]
# Create isolated Event time series (with occurencies in 1st and 13th timestamps)
spikes = [0 for i in range(len(data1))]
spikes[1] = 1
spikes[13] = 1
Provide the name of the target series, the time window length using the context_horizon parameter, and Causality function to calculate CR (we leave this for later).
from PdmContext.utils.causal_discovery_functions import calculatewithPc
con_gen = ContextGenerator(target="anomaly1", context_horizon="8 hours", Causalityfunct=calculatewithPc, debug=False)
Iteratevly we pass the data to the Context Generator by calling collec_data() method. Each time we pass a single data sample or event from a single source. This method will return a Context object when we pass data with name of the specified target name
source = "press"
for d1, an1, t1, sp1, con1 in zip(data1, anomaly1, timestamps, spikes, configur):
con_gen.collect_data(timestamp=t1, source=source, name="data1", value=d1)
if sp1 == 1:
con_gen.collect_data(timestamp=t1, source=source, name="spike", type="isolated")
if con1 == 1:
con_gen.collect_data(timestamp=t1, source=source, name="config", type="configuration")
contextTemp = con_gen.collect_data(timestamp=t1, source=source, name="anomaly1", value=an1)
We can plot the last context object which was returned (this object is refered to the last timestamp) We see in the first plot the CD part of Context and in the second plot the CR (in form of graph).
contextTemp.plot()
[('anomaly1', 'config@press'), ('config@press', 'anomaly1')]
Moreover we can plot all the context of ContextGeneration (in more abstract level) using plot method.
con_gen.plot()
The values on this plot refer to the target value which used to build each context sample, and the colors are refered to the relationshios that exist in the CR of each context object). In that example we can see that the anomaly1 sereis has an increase due to the config event (as seemed from the CR)
The user can implement and provide to the Context Generator its own causality discovery method:
To do this simple needs to implement a python function, that takes as a parameter:
Example: PdmContext.utils.causal_discovery_functions.calculatewithPc
import networkx as nx
# pip install castle
from castle.algorithms import PC
def calculatewithPc(names, data):
try:
pc = PC(variant='parallel')
pc.learn(data)
except Exception as e:
print(e)
return None
learned_graph = nx.DiGraph(pc.causal_matrix)
# Relabel the nodes
MAPPING = {k: n for k, n in zip(range(len(names)), names)}
learned_graph = nx.relabel_nodes(learned_graph, MAPPING, copy=True)
edges=learned_graph.edges
return edges
The current implementation support connection with two databases (SQLlite3 and Influxdb) using the PdmContext.utils.dbconnector.SQLiteHandler, and PdmContext.utils.dbconnector.InfluxDBHandler
Using SQLlite will create a database in the location of the main file.
Using Influxdb need to start the Influxdb services before start: (For example in Linux: sudo service influxdb start)
Both databases Connections can be used with implemented pipeline: PdmContext.Pipelines.ContextAndDatabase
Let's generate same example as befor but this time store it to the Database
from PdmContext.utils.dbconnector import SQLiteHandler
from PdmContext.Pipelines import ContextAndDatabase
con_gen = ContextGenerator(target="anomaly1", context_horizon="8", Causalityfunct=calculatewithPc, debug=False)
database = SQLiteHandler(db_name="ContextDatabase.db")
contextpipeline = ContextAndDatabase(context_generator_object=con_gen, databaseStore_object=database)
configur = [0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0]
data1 = [random() for i in range(len(configur))]
start = pd.to_datetime("2023-01-01 00:00:00")
timestamps = [start + pd.Timedelta(minutes=i) for i in range(len(data1))]
anomaly1 = [0.2, 0.3, 0.2, 0.1, 0.2, 0.2, 0.1, 0.4, 0.8, 0.7, 0.7, 0.8, 0.7, 0.8, 1, 0.6, 0.7]
spikes = [0 for i in range(len(data1))]
spikes[1] = 1
spikes[13] = 1
source = "press"
for d1, an1, t1, sp1, con1 in zip(data1, anomaly1, timestamps, spikes, configur):
contextpipeline.collect_data(timestamp=t1, source=source, name="data1", value=d1)
if sp1 == 1:
contextpipeline.collect_data(timestamp=t1, source=source, name="spike", type="isolated")
if con1 == 1:
contextpipeline.collect_data(timestamp=t1, source=source, name="config", type="configuration")
contextTemp = contextpipeline.collect_data(timestamp=t1, source=source, name="anomaly1", value=an1)
contextpipeline.Contexter.plot()
Now we can acces the Context objexcts stored in the Database:
database = SQLiteHandler(db_name="ContextDatabase.db")
traget_name = "anomaly1"
contextlist = database.get_all_context_by_target(traget_name)
print(len(contextlist))
17
We could also plot the contexts using the helper function from PdmContext.utils.showcontext.show_Context_list
from PdmContext.utils.showcontext import show_Context_list
show_Context_list(contextlist, target_text=traget_name)
We can use filter to exlcude some relationships from the plot (this is quite usefull when many data series are involved) Althoug there is no practial use in our example, we will now exclude config->anomaly1 relationships and keep only anomaly1->config, which are the same samples (this is done here only for presentation purposes)
show_Context_list(contextlist, target_text=traget_name, filteredges=[["config", "anomaly1", ""]])
To compare two Context object we need a similarity (or distacne measure).
The user can implement its own distance function, which accept only two paramters (two Context objects)
Below there is an example which use the distance_cc (which is the SBD distance of between the CD and jaccard similarity between the CR of tte two context weigthed by a factor a and b=1-a)
In this example we builr our own distance function my_distance(c1: Context,c2 Context) by using specific a and b
from PdmContext.utils.distances import distance_cc
def my_distance(c1:Context,c2:Context):
return distance_cc(c1,c2,a=0,b=1)
# JACCARD similarity between the CR components
print(my_distance(contextlist[0],contextlist[-1]))
print(my_distance(contextlist[-2],contextlist[-1]))
0 1.0
Using the PdmContext.ContextClustering.DBscanContextStreamClustering we can cluster over Context objects.
Clustering over context object has two main limitations:
Regarding 1) an example of simplefied DBscan algorithm for streaming data has been developed in PdmContext.ContextClustering.DBscanContextStreamClustering (we can iteratevly feed cluster method using add_sample_to_cluster method)
While for 2) a sample of ditance functions exist in PdmContext.utils.distances, and the user can define its own as shown previously
Creating a PdmContext.ContextClustering.DBscanContextStreamClustering object
from PdmContext.ContextClustering import DBscanContextStream
# use the distance function from before
clustering=DBscanContextStream(cluster_similarity_limit=0.7,distancefunc=my_distance)
for context_object in contextlist:
clustering.add_sample_to_cluster(context_object)
print(clustering.clusters_sets)
clustering.plot()
[[0, 1, 2, 3, 4, 5, 6, 7], [8, 9, 10, 11, 12, 13, 14, 15, 16]]
The use of the Cluster and ContextGenerator can be also implemented using the Pipeline ContextAndClustering (PdmContext.Pipelines.ContextAndClustering)
from PdmContext.Pipelines import ContextAndClustering
con_gen_2 = ContextGenerator(target="anomaly1", context_horizon="8 hours", Causalityfunct=calculatewithPc, debug=False)
clustering_2=DBscanContextStream(cluster_similarity_limit=0.7,min_points=2,distancefunc=my_distance)
contextpipeline2 = ContextAndClustering(context_generator_object=con_gen_2,Clustring_object=clustering_2)
source = "press"
for d1, an1, t1, sp1, con1 in zip(data1, anomaly1, timestamps, spikes, configur):
contextpipeline2.collect_data(timestamp=t1, source=source, name="data1", value=d1)
if sp1 == 1:
contextpipeline2.collect_data(timestamp=t1, source=source, name="spike", type="isolated")
if con1 == 1:
contextpipeline2.collect_data(timestamp=t1, source=source, name="config", type="configuration")
contextTemp = contextpipeline2.collect_data(timestamp=t1, source=source, name="anomaly1", value=an1)
contextpipeline2.clustering.plot()
contextpipeline2.Contexter.plot()
There are three pipeline which wrap the Database Conenctor, Context Generator and Clustering (all using the same API of collect_data as Context Generator)
Because Context Generator works in streaming fashion there are implemented simulatores that can be used as helpers for the user (PdmContext.utils.simulate_stream)
Example (PdmContext.utils.simulate_stream.simulate_stream):
This simulator need to pass two list
from PdmContext.utils.simulate_stream import simulate_stream
start2 = pd.to_datetime("2023-01-01 00:00:00")
timestamps2 = [start2 + pd.Timedelta(minutes=i) for i in range(17)]
eventconf2=("config",[pd.to_datetime("2023-01-01 00:09:00")],"configuration")
spiketuples2=("spikes",[pd.to_datetime("2023-01-01 00:01:00"),pd.to_datetime("2023-01-01 00:13:00")],"isolated")
data1tuples2=("data1",[random() for i in range(17)],timestamps2)
anomaly1tuples2=("anomaly1", [0.2, 0.3, 0.2, 0.1, 0.2, 0.2, 0.1, 0.4, 0.8, 0.7, 0.7, 0.8, 0.7, 0.8, 1, 0.6, 0.7],timestamps2)
stream=simulate_stream([data1tuples2,anomaly1tuples2],[eventconf2,spiketuples2],"anomaly1")
contextpipeline3 = ContextGenerator(target="anomaly1", context_horizon="8 hours", Causalityfunct=calculatewithPc)
source="press"
for record in stream:
#print(record)
contextpipeline3.collect_data(timestamp=record["timestamp"], source=source, name=record["name"], type=record["type"],value=record["value"])
contextpipeline3.plot()
In case of existing dataframe with all data.
Example (PdmContext.utils.simulate_stream.simulate_from_df):
This simulator need to pass two list
from PdmContext.utils.simulate_stream import simulate_from_df
df = pd.read_csv("dummy_data.csv",index_col=0)
df.index=pd.to_datetime(df.index)
print(df.head())
traget_name="anomaly1"
stream = simulate_from_df(df,[("configur","configuration"),("spikes","isolated")], traget_name)
contextpipeline4 = ContextGenerator(target="anomaly1", context_horizon="8 hours", Causalityfunct=calculatewithPc)
source = "press"
for record in stream:
contextpipeline4.collect_data(timestamp=record["timestamp"], source=source, name=record["name"], type=record["type"],value=record["value"])
contextpipeline4.plot()
data1 configur anomaly1 spikes 2023-01-01 00:00:00 0.263462 0 0.2 0 2023-01-01 00:01:00 0.827615 0 0.3 1 2023-01-01 00:02:00 0.381941 0 0.2 0 2023-01-01 00:03:00 0.327193 0 0.1 0 2023-01-01 00:04:00 0.837698 0 0.2 0