--- title: Interactions Dataset keywords: fastai sidebar: home_sidebar summary: "Implementation of base modules for interactions dataset." description: "Implementation of base modules for interactions dataset." nb_path: "nbs/datasets/bases/datasets.bases.interactions.ipynb" ---
{% raw %}
{% endraw %} {% raw %}
{% endraw %}

InteractionsDataset

{% raw %}

class InteractionsDataset[source]

InteractionsDataset(*args, **kwds) :: Dataset

An abstract class representing a :class:Dataset.

All datasets that represent a map from keys to data samples should subclass it. All subclasses should overwrite :meth:__getitem__, supporting fetching a data sample for a given key. Subclasses could also optionally overwrite :meth:__len__, which is expected to return the size of the dataset by many :class:~torch.utils.data.Sampler implementations and the default options of :class:~torch.utils.data.DataLoader.

.. note:: :class:~torch.utils.data.DataLoader by default constructs a index sampler that yields integral indices. To make it work with a map-style dataset with non-integral indices/keys, a custom sampler must be provided.

{% endraw %} {% raw %}
{% endraw %}

Example:

{% raw %}
class ML1mDataset(InteractionsDataset):
    url = "http://files.grouplens.org/datasets/movielens/ml-1m.zip"

    @property
    def raw_file_names(self):
        return 'ratings.dat'

    def download(self):
        path = download_url(self.url, self.raw_dir)
        extract_zip(path, self.raw_dir)
        from shutil import move, rmtree
        move(osp.join(self.raw_dir, 'ml-1m', self.raw_file_names), self.raw_dir)
        rmtree(osp.join(self.raw_dir, 'ml-1m'))
        os.unlink(path)

    def load_ratings_df(self):
        df = pd.read_csv(self.raw_paths[0], sep='::', header=None, engine='python')
        df.columns = ['uid', 'sid', 'rating', 'timestamp']
        # drop duplicate user-item pair records, keeping recent ratings only
        df.drop_duplicates(subset=['uid', 'sid'], keep='last', inplace=True)
        return df
{% endraw %}

InteractionsDataModule

{% raw %}

class InteractionsDataModule[source]

InteractionsDataModule(*args:Any, **kwargs:Any) :: LightningDataModule

A DataModule standardizes the training, val, test splits, data preparation and transforms. The main advantage is consistent data splits, data preparation and transforms across models.

Example::

class MyDataModule(LightningDataModule):
    def __init__(self):
        super().__init__()
    def prepare_data(self):
        # download, split, etc...
        # only called on 1 GPU/TPU in distributed
    def setup(self, stage):
        # make assignments here (val/train/test split)
        # called on every process in DDP
    def train_dataloader(self):
        train_split = Dataset(...)
        return DataLoader(train_split)
    def val_dataloader(self):
        val_split = Dataset(...)
        return DataLoader(val_split)
    def test_dataloader(self):
        test_split = Dataset(...)
        return DataLoader(test_split)
    def teardown(self):
        # clean up after fit or test
        # called on every process in DDP

A DataModule implements 6 key methods:

  • prepare_data (things to do on 1 GPU/TPU not on every GPU/TPU in distributed mode).
  • setup (things to do on every accelerator in distributed mode).
  • train_dataloader the training dataloader.
  • val_dataloader the val dataloader(s).
  • test_dataloader the test dataloader(s).
  • teardown (things to do on every accelerator in distributed mode when finished)

This allows you to share a full dataset without explaining how to download, split, transform, and process the data

{% endraw %} {% raw %}
{% endraw %}

Example:

{% raw %}
class ML1mDataModule(InteractionsDataModule):
    dataset_cls = ML1mDataset
{% endraw %}