datasets subpackage
Overview
The datasets
subpackage is a core component designed to manage the complexities of data handling and preparation.
It provides a structured and extensible way to load, process, and manipulate datasets from various sources,
ensuring they are ready for use in machine learning models or data analysis tasks. The subpackage is built to accommodate a variety of data formats and includes functionalities for masking,
sampling, and managing multiple datasets, making it versatile for different phases of data-driven projects.
This subpackage is composed of the following modules:
loading_context: Manages the strategy for loading data, allowing flexibility in the source file format.
loading_strategies: Implements specific strategies for different file formats.
manager: Coordinates access and manipulations across multiple datasets used in ML pipelines.
masked: Allows multiple operation on the datasets, like sampling, refining, cloning…etc.
The package includes the following classes:
loading_context module
This module provides a flexible framework for loading datasets from various file formats by utilizing the strategy design pattern.
It supports dynamic selection of data loading strategies based on the file extension, enabling easy extension and maintenance.
It includes the DataLoadingContext
class, responsible for selecting and setting the right loading strategy based on the loaded file extension.
- class MED3pa.datasets.loading_context.DataLoadingContext(file_path: str)[source]
Bases:
object
A context class for managing data loading strategies. It supports setting and getting the current data loading strategy, as well as loading data as a NumPy array from a specified file.
- get_strategy() DataLoadingStrategy [source]
Returns the currently selected data loading strategy.
- Returns:
The currently selected data loading strategy.
- Return type:
- load_as_np(file_path: str, target_column_name: str) Tuple[List[str], ndarray, ndarray] [source]
Loads data from the given file path and returns it as a NumPy array, along with column labels and the target data.
- Parameters:
file_path (str) – The path to the dataset file.
target_column_name (str) – The name of the target column, such as true labels or values in case of regression.
- Returns:
A tuple containing the column labels, observations as a NumPy array, and the target as a NumPy array.
- Return type:
Tuple[List[str], np.ndarray, np.ndarray]
- set_strategy(strategy: DataLoadingStrategy) None [source]
Sets a new data loading strategy.
- Parameters:
strategy (DataLoadingStrategy) – The new data loading strategy to be used.
- strategies = {'csv': <class 'MED3pa.datasets.loading_strategies.CSVDataLoadingStrategy'>}
loading_strategies module
This module provides strategies for loading data from files into usable Python formats, focusing on converting data into NumPy arrays.
It includes an abstract base class DataLoadingStrategy
for defining common interfaces and concrete implementations of this class, such as CSVDataLoadingStrategy
for handling CSV files.
This setup allows easy extension to support additional file types as needed.
- class MED3pa.datasets.loading_strategies.CSVDataLoadingStrategy[source]
Bases:
DataLoadingStrategy
Strategy class for loading CSV data. Implements the abstract execute method to handle CSV files.
- execute(path_to_file
str, target_column_name: str) -> Tuple[List[str], np.ndarray, np.ndarray]: Loads CSV data from the given path, separates observations and target, and converts them to NumPy arrays.
- static execute(path_to_file: str, target_column_name: str) Tuple[List[str], ndarray, ndarray] [source]
Loads CSV data from the given path, separates observations and target, and converts them to NumPy arrays.
- Parameters:
path_to_file (str) – The path to the CSV file to be loaded.
target_column_name (str) – The name of the target column in the dataset.
- Returns:
Column labels, observations as a NumPy array, and target as a NumPy array.
- Return type:
Tuple[List[str], np.ndarray, np.ndarray]
- class MED3pa.datasets.loading_strategies.DataLoadingStrategy[source]
Bases:
ABC
Abstract base class for data loading strategies. Defines a common interface for all data loading strategies.
- abstract execute(target_column_name: str) Tuple[List[str], ndarray, ndarray] [source]
Abstract method to execute the data loading strategy.
- Parameters:
path_to_file (str) – The path to the file to be loaded.
target_column_name (str) – The name of the target column in the dataset.
- Returns:
A tuple containing the column labels, observations as a NumPy array, and the target as a NumPy array.
- Return type:
Tuple[List[str], np.ndarray, np.ndarray]
manager module
The manager.py module manages the different datasets needed for machine learning workflows, particularly for Detectron
and Med3pa
methods.
It includes the DatasetsManager
class that contains the training, validation, reference, and testing datasets for a specific ML task.
- class MED3pa.datasets.manager.DatasetsManager[source]
Bases:
object
Manages various datasets for execution of detectron and med3pa methods.
This manager is responsible for loading and holding different sets of data, including training, validation, reference (or domain dataset), and testing datasets (or new encountered data).
- combine(dataset_types: list = None) MaskedDataset [source]
Combines the specified datasets and returns a new MaskedDataset instance.
- Parameters:
dataset_types (list, optional) – List of dataset types to combine. Valid options are ‘training’, ‘validation’, ‘reference’, ‘testing’. If None, combines all datasets that are set.
- Returns:
A new MaskedDataset instance containing the combined data.
- Return type:
- Raises:
ValueError – If any specified dataset is not set or if no datasets are provided.
- get_column_labels()[source]
Retrieves the column labels of the manager
- Returns:
A list of the column labels extracted from the files.
- Return type:
List[str]
- get_dataset_by_type(dataset_type: str, return_instance: bool = False) MaskedDataset [source]
Helper method to get a dataset by type.
- Parameters:
dataset_type (str) – The type of dataset to retrieve (‘training’, ‘validation’, ‘reference’, ‘testing’).
- Returns:
The corresponding MaskedDataset instance.
- Return type:
- Raises:
ValueError – If an invalid dataset_type is provided.
- get_info(show_details: bool = True) dict [source]
Returns information about all the datasets managed by the DatasetsManager.
- Parameters:
detailed (bool) – If True, includes detailed information about each dataset. If False, only indicates whether each dataset is set.
- Returns:
A dictionary containing information about each dataset.
- Return type:
dict
- save_dataset_to_csv(dataset_type: str, file_path: str) None [source]
Saves the specified dataset to a CSV file.
- Parameters:
dataset_type (str) – The type of dataset to save (‘training’, ‘validation’, ‘reference’, ‘testing’).
file_path (str) – The file path to save the dataset to.
- Raises:
ValueError – If an invalid dataset_type is provided.
- set_column_labels(columns: list) None [source]
Sets the column labels for the datasets, excluding the target column.
- Parameters:
columns (list) – The list of columns excluding the target column.
- Raises:
ValueError – If the target column is not found in the list of columns.
- set_from_data(dataset_type: str, observations: ndarray, true_labels: ndarray, column_labels: list = None) None [source]
Sets the specified dataset using numpy arrays for observations and true labels.
- Parameters:
dataset_type (str) – The type of dataset to set (‘training’, ‘validation’, ‘reference’, ‘testing’).
observations (np.ndarray) – The feature vectors of the dataset.
true_labels (np.ndarray) – The true labels of the dataset.
column_labels (list, optional) – The list of column labels for the dataset. Defaults to None.
- Raises:
ValueError – If an invalid dataset_type is provided or if column labels do not match existing column labels.
ValueError – If column_labels and target_column_name are not provided when column_labels are not set.
- set_from_file(dataset_type: str, file: str, target_column_name: str) None [source]
Loads and sets the specified dataset from a file.
- Parameters:
dataset_type (str) – The type of dataset to set (‘training’, ‘validation’, ‘reference’, ‘testing’).
file (str) – The file path to the data.
target_column_name (str) – The name of the target column in the dataset.
- Raises:
ValueError – If an invalid dataset_type is provided or if the shape of observations does not match column labels.
masked module
The masked.py module includes the MaskedDataset
class that is capable of handling many dataset related operations, such as cloning, sampling, refining…etc.
- class MED3pa.datasets.masked.MaskedDataset(observations: ndarray, true_labels: ndarray, column_labels: list = None)[source]
Bases:
Dataset
A dataset wrapper for PyTorch that supports masking and sampling of data points.
- clone() MaskedDataset [source]
Creates a clone of the current MaskedDataset instance.
- Returns:
A new instance of MaskedDataset containing the same data and configurations as the current instance.
- Return type:
- combine(other: MaskedDataset) MaskedDataset [source]
Combines the current MaskedDataset with another MaskedDataset.
- Parameters:
other (MaskedDataset) – The other MaskedDataset to combine with.
- Returns:
A new instance of MaskedDataset containing the combined data.
- Return type:
- Raises:
ValueError – If the column labels of the two datasets do not match.
- get_confidence_scores() ndarray [source]
Gets the confidence scores of the dataset.
- Returns:
The confidence scores of the dataset.
- Return type:
np.ndarray
- get_file_path() str [source]
Gets the file path of the dataset if it has been set from a file.
- Returns:
The file path of the dataset.
- Return type:
str
- get_info() dict [source]
Returns information about the MaskedDataset.
- Returns:
A dictionary containing dataset information.
- Return type:
dict
- get_observations() ndarray [source]
Gets the observations vectors of the dataset.
- Returns:
The observations vectors of the dataset.
- Return type:
np.ndarray
- get_pseudo_labels() ndarray [source]
Gets the pseudo labels of the dataset.
- Returns:
The pseudo labels of the dataset.
- Return type:
np.ndarray
- get_pseudo_probabilities() ndarray [source]
Gets the pseudo probabilities of the dataset.
- Returns:
The pseudo probabilities of the dataset.
- Return type:
np.ndarray
- get_sample_counts() ndarray [source]
Gets the how many times each element of the dataset was sampled.
- Returns:
The sample counts of the dataset.
- Return type:
np.ndarray
- get_true_labels() ndarray [source]
Gets the true labels of the dataset.
- Returns:
The true labels of the dataset.
- Return type:
np.ndarray
- refine(mask: ndarray) int [source]
Refines the dataset by applying a mask to select specific data points.
- Parameters:
mask (np.ndarray) – A boolean array indicating which data points to keep.
- Returns:
The number of data points remaining after applying the mask.
- Return type:
int
- Raises:
ValueError – If the length of the mask doesn’t match the number of data points.
- sample_random(N: int, seed: int) MaskedDataset [source]
Samples N data points randomly from the dataset using the given seed.
- Parameters:
N (int) – The number of samples to return.
seed (int) – The seed for random number generator.
- Returns:
A new instance of the dataset containing N random samples.
- Return type:
- Raises:
ValueError – If N is greater than the current number of data points in the dataset.
- sample_uniform(N: int, seed: int) MaskedDataset [source]
Samples N data points from the dataset, prioritizing the least sampled points.
- Parameters:
N (int) – The number of samples to return.
seed (int) – The seed for random number generator.
- Returns:
A new instance of the dataset containing N random samples.
- Return type:
- Raises:
ValueError – If N is greater than the current number of data points in the dataset.
- save_to_csv(file_path: str) None [source]
Saves the dataset to a CSV file.
- Parameters:
file_path (str) – The file path to save the dataset to.
- set_confidence_scores(confidence_scores: ndarray) None [source]
Sets the confidence scores for the dataset.
- Parameters:
confidence_scores (np.ndarray) – The confidence scores array to be set.
- Raises:
ValueError – If the shape of confidence_scores does not match the number of samples in the observations array.
- set_file_path(file: str) None [source]
Sets the file path of the dataset if it has been set from a file.
- Parameters:
file (str) – The file path of the dataset.
- set_pseudo_labels(pseudo_labels: ndarray) None [source]
Adds pseudo labels to the dataset.
- Parameters:
pseudo_labels (np.ndarray) – The pseudo labels to add.
- Raises:
ValueError – If the length of pseudo_labels does not match the number of samples.
- set_pseudo_probs_labels(pseudo_probabilities: ndarray, threshold=0.5) None [source]
Sets the pseudo probabilities and corresponding pseudo labels for the dataset. The labels are derived by applying a threshold to the probabilities.
- Parameters:
pseudo_probabilities (np.ndarray) – The pseudo probabilities array to be set.
threshold (float, optional) – The threshold to convert probabilities to binary labels. Defaults to 0.5.
- Raises:
ValueError – If the shape of pseudo_probabilities does not match the number of samples in the observations array.