Multiview dataset management¶
-
class
Dataset
¶ This is the base class for all the type of multiview datasets of SuMMIT.
-
get_shape
(view_index=0, sample_indices=None)¶ Gets the shape of the needed view on the asked samples
Parameters: - view_index (int) – The index of the view to extract
- sample_indices (numpy.ndarray) – The array containing the indices of the samples to extract.
Returns: Return type: Tuple containing the shape
-
init_sample_indices
(sample_indices=None)¶ If no sample indices are provided, selects all the available samples.
Parameters: sample_indices (np.array,) – An array-like containing the indices of the samples.
-
to_numpy_array
(sample_indices=None, view_indices=None)¶ Concatenates the needed views in one big numpy array while saving the limits of each view in a list, to be able to retrieve them later.
Parameters: - sample_indices (array like) – The indices of the samples to extract from the dataset
- view_indices (array like) – The indices of the view to concatenate in the numpy array
Returns: - concat_views (numpy array,) – The numpy array containing all the needed views.
- view_limits (list of int) – The limits of each slice used to extract the views.
-
-
class
HDF5Dataset
(views=None, labels=None, are_sparse=False, file_name='dataset.hdf5', view_names=None, path='', hdf5_file=None, labels_names=None, is_temp=False, sample_ids=None)¶ Dataset class
This is used to encapsulate the multiview dataset while keeping it stored on the disk instead of in RAM.
Parameters: - views (list of numpy arrays or None) – The list containing each view of the dataset as a numpy array of shape (nb samples, nb features).
- labels (numpy array or None) – The labels for the multiview dataset, of shape (nb samples, ).
- are_sparse (list of bool, or None) – The list of boolean telling if each view is sparse or not.
- file_name (str, or None) – The name of the hdf5 file that will be created to store the multiview dataset.
- view_names (list of str, or None) – The name of each view.
- path (str, or None) – The path where the hdf5 dataset file will be stored
- hdf5_file (h5py.File object, or None) – If not None, the dataset will be imported directly from this file.
- labels_names (list of str, or None) – The name for each unique value of the labels given in labels.
- is_temp (bool) – Used if a temporary dataset has to be stored by the benchmark.
-
dataset
¶ The h5py file pbject that points to the hdf5 dataset on the disk.
Type: h5py.File object
-
nb_view
¶ The number of views in the dataset.
Type: int
-
view_dict
¶ - The dictionnary with the name of each view as the keys and their indices
- as values
Type: dict
-
get_label_names
(decode=True, sample_indices=None)¶ Used to get the list of the label names for the given set of samples
Parameters: - decode (bool) – If True, will decode the label names before listing them
- sample_indices (numpy.ndarray) – The array containing the indices of the needed samples
Returns: - list
- seleted labels’ names
-
get_labels
(sample_indices=None)¶ Gets the label array for the asked samples
Parameters: sample_indices (numpy.ndarray) – The array containing the indices of the samples to extract. Returns: Return type: numpy.ndarray containing the labels of the asked samples
-
get_name
()¶ Gets the name of the dataset hdf5 file
-
get_nb_class
(sample_indices=None)¶ Gets the number of classes of the dataset for the asked samples
Parameters: sample_indices (numpy.ndarray) – The array containing the indices of the samples to extract. Returns: int Return type: The number of classes
-
get_nb_samples
()¶ Used to get the number of samples available in hte dataset
Returns: Return type: int
-
get_shape
(view_index=0, sample_indices=None)¶ Gets the shape of the needed view on the asked samples
Parameters: - view_index (int) – The index of the view to extract
- sample_indices (numpy.ndarray) – The array containing the indices of the samples to extract.
Returns: Return type: Tuple containing the shape
-
get_v
(view_index, sample_indices=None)¶ Extract the view and returns a numpy.ndarray containing the description of the samples specified in sample_indices
Parameters: - view_index (int) – The index of the view to extract
- sample_indices (numpy.ndarray) – The array containing the indices of the samples to extract.
Returns: Return type: A numpy.ndarray containing the view data for the needed samples
-
get_view_dict
()¶ Returns the dictionary containing view indices as keys and their corresponding names as values
-
get_view_name
(view_idx)¶ Method to get a view’s name from its index.
Parameters: view_idx (int) – The index of the view in the dataset Returns: Return type: The view’s name.
-
init_attrs
()¶ Used to init the attributes that are modified when self.dataset changes
-
init_sample_indices
(sample_indices=None)¶ If no sample indices are provided, selects all the available samples.
Parameters: sample_indices (np.array,) – An array-like containing the indices of the samples.
-
rm
()¶ Method used to delete the dataset file on the disk if the dataset is temporary.
-
to_numpy_array
(sample_indices=None, view_indices=None)¶ Concatenates the needed views in one big numpy array while saving the limits of each view in a list, to be able to retrieve them later.
Parameters: - sample_indices (array like) – The indices of the samples to extract from the dataset
- view_indices (array like) – The indices of the view to concatenate in the numpy array
Returns: - concat_views (numpy array,) – The numpy array containing all the needed views.
- view_limits (list of int) – The limits of each slice used to extract the views.
-
class
RAMDataset
(views=None, labels=None, are_sparse=False, view_names=None, labels_names=None, sample_ids=None, name=None)¶ -
get_shape
(view_index=0, sample_indices=None)¶ Gets the shape of the needed view on the asked samples
Parameters: - view_index (int) – The index of the view to extract
- sample_indices (numpy.ndarray) – The array containing the indices of the samples to extract.
Returns: Return type: Tuple containing the shape
-
init_sample_indices
(sample_indices=None)¶ If no sample indices are provided, selects all the available samples.
Parameters: sample_indices (np.array,) – An array-like containing the indices of the samples.
-
to_numpy_array
(sample_indices=None, view_indices=None)¶ Concatenates the needed views in one big numpy array while saving the limits of each view in a list, to be able to retrieve them later.
Parameters: - sample_indices (array like) – The indices of the samples to extract from the dataset
- view_indices (array like) – The indices of the view to concatenate in the numpy array
Returns: - concat_views (numpy array,) – The numpy array containing all the needed views.
- view_limits (list of int) – The limits of each slice used to extract the views.
-
-
confirm
(resp=True, timeout=15)¶ Used to process answer
-
copy_hdf5
(pathF, name, nbCores)¶ Used to copy a HDF5 database in case of multicore computing
-
datasets_already_exist
(pathF, name, nbCores)¶ Used to check if it’s necessary to copy datasets
-
delete_HDF5
(benchmarkArgumentsDictionaries, nbCores, dataset)¶ Used to delete temporary copies at the end of the benchmark
-
extract_subset
(matrix, used_indices)¶ Used to extract a subset of a matrix even if it’s sparse WIP
-
get_samples_views_indices
(dataset, samples_indices, view_indices)¶ This function is used to get all the samples indices and view indices if needed
-
init_multiple_datasets
(path_f, name, nb_cores)¶ Used to create copies of the dataset if multicore computation is used.
This is a temporary solution to fix the sharing memory issue with HDF5 datasets.
Parameters: - path_f (string) – Path to the original dataset directory
- name (string) – Name of the dataset
- nb_cores (int) – The number of threads that the benchmark can use
Returns: datasetFiles – Dictionary resuming which mono- and multiview algorithms which will be used in the benchmark.
Return type: None
-
input_
(timeout=15)¶ used as a UI to stop if too much HDD space will be used