--- title: Data keywords: fastai sidebar: home_sidebar summary: "Tools to help construct datasets, which may be related to loading, processing, or encoding data." description: "Tools to help construct datasets, which may be related to loading, processing, or encoding data." ---
{% raw %}
{% endraw %} {% raw %}
%load_ext autoreload
%autoreload 2
%matplotlib inline
{% endraw %} {% raw %}
{% endraw %} {% raw %}
# Only needed for testing.
from collections import Counter
from itertools import chain
import numpy as np
from torch.utils.data import DataLoader

from htools import eprint
{% endraw %} {% raw %}
{% endraw %} {% raw %}

probabilistic_hash_item[source]

probabilistic_hash_item(x, n_buckets, mode='int', n_hashes=3)

Slightly hacky way to probabilistically hash an integer by
first converting it to a string.

Parameters
----------
x: int
    The integer or string to hash.
n_buckets: int
    The number of buckets that items will be mapped to. Typically
    this would occur outside the hashing function, but since
    the intended use case is so narrow here it makes sense to me
    to include it here.
mode: type
    The type of input you want to hash. This is user-provided to prevent
    accidents where we pass in a different item than intended and hash
    the wrong thing. One of (int, str). When using this inside a
    BloomEmbedding layer, this must be `int` because there are no
    string tensors. When used inside a dataset or as a one-time
    pre-processing step, you can choose either as long as you
    pass in the appropriate inputs.
n_hashes: int
    The number of times to hash x, each time with a different seed.

Returns
-------
list[int]: A list of integers with length `n_hashes`, where each integer
    is in [0, n_buckets).
{% endraw %} {% raw %}
{% endraw %} {% raw %}

probabilistic_hash_tensor[source]

probabilistic_hash_tensor(x_r2, n_buckets, n_hashes=3, pad_idx=0)

Hash a rank 2 LongTensor.

Parameters
----------
x_r2: torch.LongTensor
    Rank 2 tensor of integers. Shape: (bs, seq_len)
n_buckets: int
    Number of buckets to hash items into (i.e. the number of
    rows in the embedding matrix). Typically a moderately large
    prime number, like 251 or 997.
n_hashes: int
    Number of hashes to take for each input index. This determines
    the number of rows of the embedding matrix that will be summed
    to get the representation for each word. Typically 2-5.
pad_idx: int or None
    If you want to pad sequences with vectors of zeros, pass in an
    integer (same as the `padding_idx` argument to nn.Embedding).
    If None, no padding index will be used. The sequences must be
    padded before passing them into this function.

Returns
-------
torch.LongTensor: Tensor of indices where each row corresponds
    to one of the input indices. Shape: (bs, seq_len, n_hashes)
{% endraw %} {% raw %}
sents = [
    'I walked to the store so I hope it is not closed.',
    'The theater is closed today and the sky is grey.',
    'His dog is brown while hers is grey.'
]
labels = [0, 1, 1]
{% endraw %} {% raw %}
class Data(Dataset):
    
    def __init__(self, sentences, labels, seq_len):
        x = [s.split(' ') for s in sentences]
        self.w2i = self.make_w2i(x)
        self.seq_len = seq_len
        self.x = self.encode(x)
        self.y = torch.tensor(labels)
        
    def __getitem__(self, i):
        return self.x[i], self.y[i]
    
    def __len__(self):
        return len(self.y)
    
    def make_w2i(self, tok_rows):
        return {k: i for i, (k, v) in 
                enumerate(Counter(chain(*tok_rows)).most_common(), 1)}
    
    def encode(self, tok_rows):
        enc = np.zeros((len(tok_rows), self.seq_len), dtype=int)
        for i, row in enumerate(tok_rows):
            trunc = [self.w2i.get(w, 0) for w in row[:self.seq_len]]
            enc[i, :len(trunc)] = trunc
        return torch.tensor(enc)
{% endraw %}

We construct a toy dataset with a vocabulary of size 23. In reality, you might wish to lowercase text or use a better tokenizer, but this is sufficient for the purposes of demonstration.

{% raw %}
ds = Data(sents, labels, 10)
len(ds.w2i)
23
{% endraw %} {% raw %}
dl = DataLoader(ds, batch_size=3)
x, y = next(iter(dl))
x, y
(tensor([[ 2,  5,  6,  3,  7,  8,  2,  9, 10,  1],
         [13, 14,  1, 15, 16, 17,  3, 18,  1,  4],
         [19, 20,  1, 21, 22, 23,  1,  4,  0,  0]]), tensor([0, 1, 1]))
{% endraw %} {% raw %}
x.shape
torch.Size([3, 10])
{% endraw %}

We hash each word index 4 times, as specified by the n_hashes parameter in probabilistic_hash_tensor. Notice that we only use 7 buckets, meaning the embedding matrix will have 7 rows rather than 23 (not counting a padding row).

{% raw %}
x_hashed = probabilistic_hash_tensor(x, n_buckets=7, n_hashes=4)
print('x shape:', x.shape)
print('x_hashed shape:', x_hashed.shape)
x shape: torch.Size([3, 10])
x_hashed shape: torch.Size([3, 10, 4])
{% endraw %}

Below, each row of 4 numbers encodes a single word.

{% raw %}
x_hashed
tensor([[[2, 0, 2, 2],
         [1, 6, 2, 4],
         [2, 0, 1, 4],
         [5, 2, 6, 5],
         [1, 5, 1, 3],
         [0, 4, 0, 0],
         [2, 0, 2, 2],
         [2, 4, 0, 2],
         [3, 4, 4, 6],
         [5, 0, 3, 6]],

        [[5, 4, 4, 2],
         [5, 3, 3, 1],
         [5, 0, 3, 6],
         [2, 1, 1, 1],
         [2, 4, 4, 4],
         [2, 4, 2, 6],
         [5, 2, 6, 5],
         [3, 5, 0, 0],
         [5, 0, 3, 6],
         [1, 5, 2, 6]],

        [[5, 5, 3, 4],
         [4, 5, 5, 1],
         [5, 0, 3, 6],
         [6, 2, 0, 6],
         [4, 2, 6, 1],
         [3, 6, 1, 6],
         [5, 0, 3, 6],
         [1, 5, 2, 6],
         [0, 0, 0, 0],
         [0, 0, 0, 0]]])
{% endraw %}

See how each word is mapped to a list of 4 indices.

{% raw %}
for word, i in zip(sents[0].split(' '), x[0]):
    print(word, probabilistic_hash_item(i.item(), 7, int, 4))
I [2, 0, 2, 2]
walked [1, 6, 2, 4]
to [2, 0, 1, 4]
the [5, 2, 6, 5]
store [1, 5, 1, 3]
so [0, 4, 0, 0]
I [2, 0, 2, 2]
hope [2, 4, 0, 2]
it [3, 4, 4, 6]
is [5, 0, 3, 6]
{% endraw %}

Notice that hashing the words directly is also possible, but the resulting hashes will be different than if hashing after encoding words as integers. This is fine as long as you are consistent.

{% raw %}
for row in [s.split(' ') for s in sents]:
    eprint(list(zip(row, (probabilistic_hash_item(word, 11, str) for word in row))))
    print()
 0: ('I', [0, 5, 5])
 1: ('walked', [2, 5, 1])
 2: ('to', [10, 4, 6])
 3: ('the', [4, 1, 4])
 4: ('store', [4, 6, 3])
 5: ('so', [7, 8, 8])
 6: ('I', [0, 5, 5])
 7: ('hope', [1, 2, 7])
 8: ('it', [3, 9, 0])
 9: ('is', [3, 1, 3])
10: ('not', [6, 10, 4])
11: ('closed.', [3, 6, 10])

 0: ('The', [1, 6, 9])
 1: ('theater', [8, 10, 2])
 2: ('is', [3, 1, 3])
 3: ('closed', [5, 5, 0])
 4: ('today', [3, 10, 8])
 5: ('and', [7, 2, 4])
 6: ('the', [4, 1, 4])
 7: ('sky', [1, 2, 9])
 8: ('is', [3, 1, 3])
 9: ('grey.', [7, 6, 7])

 0: ('His', [0, 10, 3])
 1: ('dog', [8, 6, 6])
 2: ('is', [3, 1, 3])
 3: ('brown', [9, 8, 9])
 4: ('while', [9, 2, 8])
 5: ('hers', [0, 5, 4])
 6: ('is', [3, 1, 3])
 7: ('grey.', [7, 6, 7])

{% endraw %}

Below, we show that we can obtain unique representations for >99.9% of words in a vocabulary of 30,000 words with a far smaller embedding matrix. The number of buckets is the number of rows in the embedding matrix.

{% raw %}
def unique_combos(tups):
    return len(set(tuple(sorted(x)) for x in tups))
{% endraw %} {% raw %}
def hash_all_idx(vocab_size, n_buckets, n_hashes):
    return [probabilistic_hash_item(i, n_buckets, int, n_hashes) 
            for i in range(vocab_size)]
{% endraw %} {% raw %}
vocab_size = 30_000
buckets2hashes = {127: 5,
                  251: 4,
                  997: 3,
                  5_003: 2}
for b, h in buckets2hashes.items():
    tups = hash_all_idx(vocab_size, b,  h)
    unique = unique_combos(tups)
    print('\n\nBuckets:', b, '\nHashes:', h, '\nUnique combos:', unique,
          '\n% unique:', round(unique/30_000, 4))

Buckets: 127 
Hashes: 5 
Unique combos: 29998 
% unique: 0.9999


Buckets: 251 
Hashes: 4 
Unique combos: 29996 
% unique: 0.9999


Buckets: 997 
Hashes: 3 
Unique combos: 29997 
% unique: 0.9999


Buckets: 5003 
Hashes: 2 
Unique combos: 29969 
% unique: 0.999
{% endraw %}

Datasets

{% raw %}
@auto_repr
class LazyDataset(Dataset):
    """Lazily load batches from an enormous dataframe that can't fit into 
    memory.
    """

    def __init__(self, df_path, length, shuffle, chunksize=1_000, 
                 c=2, classes=('neg', 'pos'), **kwargs):
        """
        Parameters
        ----------
        df_path: str
            File path of dataframe to load.
        length: int
            Number of rows of data to use. This is required so that we don't 
            have to go through the whole file and count the number of lines,
            which can be enormous with a big dataset. It also makes it easy to
            work with a subset (the data should already be shuffled, so 
            choosing the top n rows is fine).
        shuffle: bool
            If True, shuffle the data in each chunk. Note that if batch size
            is close to chunk size, this will have minimal effect. If possible,
            the training set should therefore load as large a chunk as 
            possible if we want to shuffle the data. Shuffling is unnecessary 
            for the validation set.
        chunksize: int
            Number of rows of df to load at a time. This should usually 
            be significantly larger than the batch size in order to retain
            some randomness in the batches.
        c: int
            Number of classes. Used if training with FastAI.
        classes: iterable
            List of tuple of class names. Used if training with FastAI.
        kwargs: any
            Additional keyword arguments to pass to `read_csv`, eg. 
            compression='gzip'.
        """
        if length < chunksize:
            warnings.warn('Total # of rows < 1 full chunk. LazyDataset may '
                          'not be necessary.')

        self.length = length
        self.shuffle = shuffle
        self.chunksize = chunksize
        self.df_path = df_path
        self.df = None
        self.chunk = None
        self.chunk_idx = None
        self.df_kwargs = kwargs
        
        # Additional attributes required by FastAI. 
        # c: Number of classes in model.
        self.c = c
        self.classes = list(classes)
        
    def __len__(self):
        return self.length
    
    def __getitem__(self, idx):
        """Because not all indices are loaded at once, we must do shuffling
        in the dataset rather than the dataloader (e.g. if the loader randomly
        samples index 5000 but we have indices 0-500 loaded, it will be
        unavailable).

        Parameters
        ----------
        idx: int
            Retrieve item i in dataset.

        Returns
        -------
        tuple[np.array]: x array, y array
        """
        # Load next chunk of data if necessary. Must specify nrows, otherwise
        # we will chunk through the whole file.
        if not self.chunk_idx:
            while True:
                try:
                    self.chunk = self.df.get_chunk()
                    break
                except (AttributeError, StopIteration):
                    self.df = pd.read_csv(self.df_path, engine='python',
                                          chunksize=self.chunksize,
                                          nrows=len(self),
                                          **self.df_kwargs)

            self.chunk_idx = self.chunk.index.values
            if self.shuffle: np.random.shuffle(self.chunk_idx)
            self.chunk_idx = deque(self.chunk_idx)
            
        *x, y = self.chunk.loc[self.chunk_idx.popleft()].values
        return np.array(x), y.astype(float)
{% endraw %}

File Handling

{% raw %}
{% endraw %} {% raw %}

class BotoUploader[source]

BotoUploader(bucket, verbose=True)

Uploads files to S3. Built as a public alternative to Accio. Note to
self: the interfaces are not identical so be careful to know which you're
using.
{% endraw %} {% raw %}
up = BotoUploader('gg-datascience')
{% endraw %} {% raw %}
ft = up._convert_local_path('data/v1/history.csv')
tt = up._convert_local_path('data/v1/history.csv', 'hmamin')
tf = up._convert_local_path('data/v1/history.csv', 'hmamin', retain_tree=False)
ff = up._convert_local_path('data/v1/history.csv', retain_tree=False)

print('No S3 prefix, Yes retain file tree:\n' + ft)
print('\nYes S3 prefix, Yes retain file tree:\n' + tt)
print('\nYes S3 prefix, No retain file tree:\n' + tf)
print('\nNo S3 prefix, No retain file tree:\n' + ff)
No S3 prefix, Yes retain file tree:
data/v1/history.csv

Yes S3 prefix, Yes retain file tree:
hmamin/data/v1/history.csv

Yes S3 prefix, No retain file tree:
hmamin/history.csv

No S3 prefix, No retain file tree:
history.csv
{% endraw %}