Sample Code in python notebook to use mat-data as a python library.
The present package offers a tool, to support the user in the task of data preprocessing of multiple aspect trajectories, or to generating synthetic datasets. It integrates into a unique framework for multiple aspects trajectories and in general for multidimensional sequence data mining methods.
Created on Dec, 2023 Copyright (C) 2023, License GPL Version 3 or superior (see LICENSE file)
import sys, os
root = os.path.join('automatize', 'assets', 'examples', 'Example')
# We consider this folder organization to the experimental enviromnent:
prg_path = os.path.join(root, 'programs')
data_path = os.path.join(root, 'data')
res_path = os.path.join(root, 'results')
# OR, you can use the .jar method files in:
prg_path = os.path.join('automatize', 'assets', 'method')
To use helpers for data pre-processing, import from package matdata.preprocess
:
from matdata.preprocess import *
The preprocessing module provides some functions to work data:
Basic functions:
readDataset
: load datasets as pandas DataFrame (from .csv, .zip, or .ts)printFeaturesJSON
: print a default JSON file descriptor for Movelets methods (version 1 or 2)datasetStatistics
: calculates statistics from a datasets dataframe.Train and Test split functions:
trainAndTestSplit
: split dataset (pandas DataFrame) in train / test (70/30% by default)kfold_trainAndTestSplit
: split dataset (pandas DataFrame) in k-fold train / test (80/20% each fold by default)stratify
: extract trajectories from the dataset, creating a subset of the data (to use when smaller datasets are needed)joinTrainAndTest
: joins the train and test files into one DataFrame.Type convertion functions:
convertDataset
: default format conversions. Reads the dataset files and saves in .csv and .zip formats, also do k-fold split if not presentzip2df
: converts .zip files and saves to DataFramezip2csv
: converts .zip files and saves to .csv filesdf2zip
: converts DataFrame and saves to .zip fileszip2arf
: converts .zip and saves to .arf filesany2ts
: converts .zip or .csv files and saves to .ts filesxes2csv
: reads .xes files and converts to DataFrame#cols = ['tid','label','lat','lon','day','hour','poi','category','price','rating']
df = joinTrainAndTest(data_path, train_file="train.csv", test_file="test.csv", class_col = 'label')
df.head()
Joining train and test data from... automatize/assets/examples/Example/data Done. --------------------------------------------------------------------------------
tid | lat_lon | hour | price | poi | weather | day | label | |
---|---|---|---|---|---|---|---|---|
0 | 12 | 0.0 6.2 | 8 | -1 | Home | Clear | Monday | Classs_False |
1 | 12 | 0.8 6.2 | 9 | 2 | University | Clouds | Monday | Classs_False |
2 | 12 | 3.1 11 | 12 | 2 | Restaurant | Clear | Monday | Classs_False |
3 | 12 | 0.8 6.5 | 13 | 2 | University | Clear | Monday | Classs_False |
4 | 12 | 0.2 6.2 | 17 | -1 | Home | Rain | Monday | Classs_False |
To k-fold split a dataset into train and test:
k = 3
train, test = kfold_trainAndTestSplit(data_path, k, df, random_num=1, class_col='label')
3-fold train and test split in... automatize/assets/examples/Example/data
Spliting Data: 0%| | 0/2 [00:00<?, ?it/s]
Done. Writing files ... 1/3
Writing TRAIN - ZIP|1: 0%| | 0/7 [00:00<?, ?it/s]
Writing TEST - ZIP|1: 0%| | 0/4 [00:00<?, ?it/s]
Writing TRAIN / TEST - CSV|1
Writing TRAIN - MAT|1: 0%| | 0/7 [00:00<?, ?it/s]
Writing TEST - MAT|1: 0%| | 0/4 [00:00<?, ?it/s]
Writing files ... 2/3
Writing TRAIN - ZIP|2: 0%| | 0/7 [00:00<?, ?it/s]
Writing TEST - ZIP|2: 0%| | 0/4 [00:00<?, ?it/s]
Writing TRAIN / TEST - CSV|2
Writing TRAIN - MAT|2: 0%| | 0/7 [00:00<?, ?it/s]
Writing TEST - MAT|2: 0%| | 0/4 [00:00<?, ?it/s]
Writing files ... 3/3
Writing TRAIN - ZIP|3: 0%| | 0/8 [00:00<?, ?it/s]
Writing TEST - ZIP|3: 0%| | 0/3 [00:00<?, ?it/s]
Writing TRAIN / TEST - CSV|3
Writing TRAIN - MAT|3: 0%| | 0/8 [00:00<?, ?it/s]
Writing TEST - MAT|3: 0%| | 0/3 [00:00<?, ?it/s]
Done. --------------------------------------------------------------------------------
To convert train and test from one available format to other default formats (CSV, ZIP, MAT):
convertDataset(data_path)
Writing TRAIN - ZIP|: 0%| | 0/14 [00:00<?, ?it/s]
Writing TEST - ZIP|: 0%| | 0/14 [00:00<?, ?it/s]
Writing TRAIN - MAT|: 0%| | 0/14 [00:00<?, ?it/s]
Writing TEST - MAT|: 0%| | 0/14 [00:00<?, ?it/s]
All Done.
TODO
from matdata.generator import *
scalerSamplerGenerator
: generates trajectory datasets based on real data on scale intervalssamplerGenerator
: generate a trajectory dataset based on real datascalerRandomGenerator
: generates trajectory datasets based on random data on scale intervalsrandomGenerator
: generate a trajectory dataset based on random data
# By Tarlis Portela (2023)