So you want to code collaborative filtering
Wait... doesn't this go against the top-down approach of fast.ai?
Hopefully you've read my last blog post which explains everything I'm going to be doing in today's blog. We're going to be coding a collaborative filtering model in two ways: by probabilistic matrix factorization and then through deep learning. If you'd like to learn how deep learning works, check out my other blog post.
First, we'll download a subset of the MovieLens dataset, which contains 100,000 of the 25-million recommendation dataset. The main reason being that I'm using the GPUs on Colab and it would take too long to train a model with the full dataset.
from fastai.collab import *
from fastai.tabular.all import *
path = untar_data(URLs.ML_100k)
Path.BASE_PATH = path
path.ls()
And we can read the README file using the cat
command:
!cat {path}/README
Which tells us that u.data
contains the full data set, which is 100,000 ratings by 943 users on 1682 items, where each user rated at least 20 movies. The data is tab separated with column names: user id
, item/movie id
, rating
, and timestamp
. So, let's try reading the csv:
ratings = pd.read_csv(
path/'u.data',
delimiter = '\t',
header = None,
names = ['user', 'movie', 'rating', 'timestamp'])
ratings.head()
We'd like to know the actual movie name instead of the movie ID, so we can also read the u.item
file (although it says it's tab separated, it's actually pipe separated):
movies = pd.read_csv(
path/'u.item',
delimiter = '|',
header = None,
usecols = [0, 1],
names = ['movie', 'title'],
encoding = 'latin-1'
)
movies.head()
Then, we can merge the two tables together:
ratings = ratings.merge(movies)
ratings.head()
Then, we build our DataLoaders
object, which will have our training and validation DataLoader
s (which produces our mini-batches of Dataset
s).
dls = CollabDataLoaders.from_df(ratings, item_name = 'title', bs = 64)
dls.show_batch()
And then we create our model, which will contain our embeddings. We can't just index into a matrix for a deep learning model since we have to calculate the derivative for each operation we do. Instead, we use one-hot encoding, which is a vector that has a 1
in the places that we want to index in. For example, if we have an array [0, 1, 2, 3]
and we want the element in the 2nd index (2
), we would matrix multiply [0, 0, 1, 0]
to the array's transpose:
$$\begin{bmatrix}0 & 1 & 2 & 3\end{bmatrix}^T\begin{bmatrix}0&0&1&0\end{bmatrix}=\begin{bmatrix}0\\1\\2\\3\end{bmatrix}\begin{bmatrix}0&0&1&0\end{bmatrix}=\begin{bmatrix}2\end{bmatrix}$$
But, storing and using one-hot encoding vectors are pretty time and memory consuming, so we use a special layer in most deep learning libraries (like PyTorch) called embedding. Embedding is mimicking the process of multiplying by a one-hot-encoded matrix, but it just indexes into a matrix using an integer while having its derivative calculated in a way such that it's identical to what it would've been if a matrix multiplication was done with a one-hot-encoded vector.
Optimizers need to be able to get all the parameters from a model, so all embedding does is randomly initialize a matrix and wrap it around the
nn.Parameter
class which tells PyTorch that it's a trainable parameter.
When we refer to an embedding, that's the embedding matrix, which is the thing that's multiplied by the one-hot-encoded matrix or the thing that's being indexed into. So, an embedding matrix in this case, is our latent factors (and biases).
When creating a neural network model with PyTorch, we have to inherit from their Module
class which contains the essentials; we just have to define __init__
(called dunder init) to initialize our model and forward
which is essentially the "predict" step in the model. forward
accepts the parameters of a mini-batch and returns a prediction.
n_users = len(dls.classes['user'])
n_items = len(dls.classes['title'])
n_factors = 50
class DotProduct(Module):
def __init__(self, n_users, n_items, n_factors, y_range = (0, 5.5)):
# User latent factors and biases
self.user_factors = Embedding(n_users, n_factors)
self.user_bias = Embedding(n_users, 1)
# Item latent factors and biases
self.item_factors = Embedding(n_items, n_factors)
self.item_bias = Embedding(n_items, 1)
# Range for our predictions
self.y_range = y_range
def forward(self, x):
# Get first column (the users) from input
users = self.user_factors(x[:,0])
# Get second column (the titles) from input
items = self.item_factors(x[:,1])
# Calculate the dot product
dot_prod = (users * items).sum(dim = 1, keepdim = True)
# Add biases to the dot product
# We add the user biases and the item biases together
dot_prod += self.user_bias(x[:,0]) + self.item_bias(x[:,1])
# Return the prediction in the chosen range
# Sigmoid is a function that returns a value between 0 and 1
# We can multiply it by (hi - lo) and add lo to get a value
# between lo and hi, which is what sigmoid_range does
return sigmoid_range(dot_prod, *self.y_range)
Now that we have our model, we can create an object with it and pass it into a Learner
and train it.
model = DotProduct(n_users, n_items, n_factors)
learn = Learner(dls, model, loss_func = MSELossFlat())
# We also use weight decay since we have bias in our model
learn.fit_one_cycle(5, 5e-3, wd = 0.1)
And, we don't need to define our own DotProduct
class. We can instead use fast.ai's collab_learner
.
learn = collab_learner(dls, n_factors = 50, y_range = (0, 5.5))
learn.fit_one_cycle(5, 5e-3, wd = 0.1)
And we see the results are similar since the model used by collab_learner
is essentially equivalent:
# 50 latent factors for users and items
# bias for users and items
learn.model
To turn our architecture into a deep learning model, we need a neural network. With a neural network, we start with a large matrix that we pass through layers. Instead of taking the dot product, we concatenate the latent factors from the users and the items. So, we also don't need the same number of latent factors for users as for items. To get the number of latent factors, we can use fast.ai's get_emb_sz
function on our DataLoaders
, which will give us recommended latent factors:
embs = get_emb_sz(dls)
embs
And, we can rewrite our DotProduct
class like so:
class SimpleNet(Module):
def __init__(self, user_sz, item_sz, y_range = (0, 5.5), n_acts = 100):
# nn.Linear implements bias implicitly, so we
# don't need to define our own bias.
self.user_factors = Embedding(*user_sz)
self.item_factors = Embedding(*item_sz)
self.layers = nn.Sequential(
nn.Linear(user_sz[1] + item_sz[1], n_acts),
nn.ReLU(),
nn.Linear(n_acts, 1))
self.y_range = y_range
def forward(self, x):
embs = self.user_factors(x[:,0]),self.item_factors(x[:,1])
x = self.layers(torch.cat(embs, dim = 1))
return sigmoid_range(x, *self.y_range)
Then, we can put it in a Learner
and train our deep learning model:
model = SimpleNet(*embs)
learn = Learner(dls, model, loss_func = MSELossFlat())
learn.fit_one_cycle(5, 5e-3, wd = 0.1)
And, like how we didn't need to define our own DotProduct
class and use collab_learner
instead, we can also do the same with SimpleNet
.
# We just have to enable the use_nn parameter and
# give it layers
learn = collab_learner(dls, use_nn = True, y_range = (0, 5.5), layers = [100, 50])
learn.fit_one_cycle(5, 5e-3, wd = 0.1)
Now that you've trained a model, there's several ways to interpret your results.
First, we can look at the biases:
# First, take the biases and put them into
# a one-dimensional tensor that we can sort
item_bias = learn.model.item_bias.weight.squeeze()
# argsort returns a list of indexes that would
# sort the tensor
idxs_bot = item_bias.argsort()[:5]
idxs_top = item_bias.argsort(descending = True)[:5]
# display the titles of the 5 "worst" movies
# and the 5 "best" movies, respectively
[dls.classes['title'][i] for i in idxs_bot],[dls.classes['title'][i] for i in idxs_top]
Then, we can find the distances:
item_factors = learn.model.item_factors.weight
idx = dls.classes['title'].o2i['Toy Story (1995)']
distances = nn.CosineSimilarity()(item_factors, item_factors[idx][None])
idx = distances.argsort(descending = True)[1:5]
dls.classes['title'][idx]
So, the four movies in the data set that are most similar to Toy Story
are the ones above.
In the next few blogs, I'll be talking more about deep and machine learning with tabular data.