--- title: MovieLens Dataset keywords: fastai sidebar: home_sidebar summary: "Implementation of MovieLens datasets." description: "Implementation of MovieLens datasets." nb_path: "nbs/datasets/datasets.movielens.ipynb" ---
{% raw %}
{% endraw %} {% raw %}
{% endraw %}

ML1m Rating Dataset

{% raw %}

class ML1mDataset[source]

ML1mDataset(*args, **kwds) :: InteractionsDataset

An abstract class representing a :class:Dataset.

All datasets that represent a map from keys to data samples should subclass it. All subclasses should overwrite :meth:__getitem__, supporting fetching a data sample for a given key. Subclasses could also optionally overwrite :meth:__len__, which is expected to return the size of the dataset by many :class:~torch.utils.data.Sampler implementations and the default options of :class:~torch.utils.data.DataLoader.

.. note:: :class:~torch.utils.data.DataLoader by default constructs a index sampler that yields integral indices. To make it work with a map-style dataset with non-integral indices/keys, a custom sampler must be provided.

{% endraw %} {% raw %}
{% endraw %} {% raw %}

class ML1mDataModule[source]

ML1mDataModule(*args:Any, **kwargs:Any) :: InteractionsDataModule

A DataModule standardizes the training, val, test splits, data preparation and transforms. The main advantage is consistent data splits, data preparation and transforms across models.

Example::

class MyDataModule(LightningDataModule):
    def __init__(self):
        super().__init__()
    def prepare_data(self):
        # download, split, etc...
        # only called on 1 GPU/TPU in distributed
    def setup(self, stage):
        # make assignments here (val/train/test split)
        # called on every process in DDP
    def train_dataloader(self):
        train_split = Dataset(...)
        return DataLoader(train_split)
    def val_dataloader(self):
        val_split = Dataset(...)
        return DataLoader(val_split)
    def test_dataloader(self):
        test_split = Dataset(...)
        return DataLoader(test_split)
    def teardown(self):
        # clean up after fit or test
        # called on every process in DDP

A DataModule implements 6 key methods:

  • prepare_data (things to do on 1 GPU/TPU not on every GPU/TPU in distributed mode).
  • setup (things to do on every accelerator in distributed mode).
  • train_dataloader the training dataloader.
  • val_dataloader the val dataloader(s).
  • test_dataloader the test dataloader(s).
  • teardown (things to do on every accelerator in distributed mode when finished)

This allows you to share a full dataset without explaining how to download, split, transform, and process the data

{% endraw %} {% raw %}
{% endraw %} {% raw %}
class Args:
    def __init__(self):
        self.data_dir = '/content/data'
        self.min_rating = 4
        self.num_negative_samples = 99
        self.min_uc = 5
        self.min_sc = 5
        self.val_p = 0.2
        self.test_p = 0.2
        self.seed = 42
        self.split_type = 'stratified'

args = Args()
{% endraw %} {% raw %}
ds = ML1mDataModule(**args.__dict__)
ds.prepare_data()
Processing...
Turning into implicit ratings
Filtering triplets
Densifying index
Done!
{% endraw %} {% raw %}
!tree -h --du -C "{args.data_dir}"
/content/data
├── [ 11M]  processed
│   ├── [2.3M]  data_test_neg.pt
│   ├── [ 95K]  data_test_pos.pt
│   ├── [6.5M]  data_train.pt
│   ├── [2.3M]  data_valid_neg.pt
│   └── [ 95K]  data_valid_pos.pt
└── [ 23M]  raw
    └── [ 23M]  ratings.dat

  35M used in 2 directories, 6 files
{% endraw %}

ML100k Dataset

{% raw %}

class ML100kDataset[source]

ML100kDataset(root) :: Dataset

Dataset base class

{% endraw %} {% raw %}
{% endraw %}