Hopefully you've read my last blog post which explains everything I'm going to be doing in today's blog. We're going to be coding a collaborative filtering model in two ways: by probabilistic matrix factorization and then through deep learning. If you'd like to learn how deep learning works, check out my other blog post.

First, we'll download a subset of the MovieLens dataset, which contains 100,000 of the 25-million recommendation dataset. The main reason being that I'm using the GPUs on Colab and it would take too long to train a model with the full dataset.

from fastai.collab import *
from fastai.tabular.all import *
path = untar_data(URLs.ML_100k)
Path.BASE_PATH = path
path.ls()

100.15% [4931584/4924029 00:00<00:00]
(#23) [Path('ua.test'),Path('u.item'),Path('u2.base'),Path('u4.test'),Path('u.user'),Path('u.genre'),Path('u.occupation'),Path('u1.test'),Path('u5.base'),Path('u2.test')...]

And we can read the README file using the cat command:

!cat {path}/README

SUMMARY & USAGE LICENSE
=============================================

MovieLens data sets were collected by the GroupLens Research Project
at the University of Minnesota.
 
This data set consists of:
	* 100,000 ratings (1-5) from 943 users on 1682 movies. 
	* Each user has rated at least 20 movies. 
        * Simple demographic info for the users (age, gender, occupation, zip)

The data was collected through the MovieLens web site
(movielens.umn.edu) during the seven-month period from September 19th, 
1997 through April 22nd, 1998. This data has been cleaned up - users
who had less than 20 ratings or did not have complete demographic
information were removed from this data set. Detailed descriptions of
the data file can be found at the end of this file.

Neither the University of Minnesota nor any of the researchers
involved can guarantee the correctness of the data, its suitability
for any particular purpose, or the validity of results based on the
use of the data set.  The data set may be used for any research
purposes under the following conditions:

     * The user may not state or imply any endorsement from the
       University of Minnesota or the GroupLens Research Group.

     * The user must acknowledge the use of the data set in
       publications resulting from the use of the data set
       (see below for citation information).

     * The user may not redistribute the data without separate
       permission.

     * The user may not use this information for any commercial or
       revenue-bearing purposes without first obtaining permission
       from a faculty member of the GroupLens Research Project at the
       University of Minnesota.

If you have any further questions or comments, please contact GroupLens
<grouplens-info@cs.umn.edu>. 

CITATION
==============================================

To acknowledge use of the dataset in publications, please cite the 
following paper:

F. Maxwell Harper and Joseph A. Konstan. 2015. The MovieLens Datasets:
History and Context. ACM Transactions on Interactive Intelligent
Systems (TiiS) 5, 4, Article 19 (December 2015), 19 pages.
DOI=http://dx.doi.org/10.1145/2827872


ACKNOWLEDGEMENTS
==============================================

Thanks to Al Borchers for cleaning up this data and writing the
accompanying scripts.

PUBLISHED WORK THAT HAS USED THIS DATASET
==============================================

Herlocker, J., Konstan, J., Borchers, A., Riedl, J.. An Algorithmic
Framework for Performing Collaborative Filtering. Proceedings of the
1999 Conference on Research and Development in Information
Retrieval. Aug. 1999.

FURTHER INFORMATION ABOUT THE GROUPLENS RESEARCH PROJECT
==============================================

The GroupLens Research Project is a research group in the Department
of Computer Science and Engineering at the University of Minnesota.
Members of the GroupLens Research Project are involved in many
research projects related to the fields of information filtering,
collaborative filtering, and recommender systems. The project is lead
by professors John Riedl and Joseph Konstan. The project began to
explore automated collaborative filtering in 1992, but is most well
known for its world wide trial of an automated collaborative filtering
system for Usenet news in 1996.  The technology developed in the
Usenet trial formed the base for the formation of Net Perceptions,
Inc., which was founded by members of GroupLens Research. Since then
the project has expanded its scope to research overall information
filtering solutions, integrating in content-based methods as well as
improving current collaborative filtering technology.

Further information on the GroupLens Research project, including
research publications, can be found at the following web site:
        
        http://www.grouplens.org/

GroupLens Research currently operates a movie recommender based on
collaborative filtering:

        http://www.movielens.org/

DETAILED DESCRIPTIONS OF DATA FILES
==============================================

Here are brief descriptions of the data.

ml-data.tar.gz   -- Compressed tar file.  To rebuild the u data files do this:
                gunzip ml-data.tar.gz
                tar xvf ml-data.tar
                mku.sh

u.data     -- The full u data set, 100000 ratings by 943 users on 1682 items.
              Each user has rated at least 20 movies.  Users and items are
              numbered consecutively from 1.  The data is randomly
              ordered. This is a tab separated list of 
	         user id | item id | rating | timestamp. 
              The time stamps are unix seconds since 1/1/1970 UTC   

u.info     -- The number of users, items, and ratings in the u data set.

u.item     -- Information about the items (movies); this is a tab separated
              list of
              movie id | movie title | release date | video release date |
              IMDb URL | unknown | Action | Adventure | Animation |
              Children's | Comedy | Crime | Documentary | Drama | Fantasy |
              Film-Noir | Horror | Musical | Mystery | Romance | Sci-Fi |
              Thriller | War | Western |
              The last 19 fields are the genres, a 1 indicates the movie
              is of that genre, a 0 indicates it is not; movies can be in
              several genres at once.
              The movie ids are the ones used in the u.data data set.

u.genre    -- A list of the genres.

u.user     -- Demographic information about the users; this is a tab
              separated list of
              user id | age | gender | occupation | zip code
              The user ids are the ones used in the u.data data set.

u.occupation -- A list of the occupations.

u1.base    -- The data sets u1.base and u1.test through u5.base and u5.test
u1.test       are 80%/20% splits of the u data into training and test data.
u2.base       Each of u1, ..., u5 have disjoint test sets; this if for
u2.test       5 fold cross validation (where you repeat your experiment
u3.base       with each training and test set and average the results).
u3.test       These data sets can be generated from u.data by mku.sh.
u4.base
u4.test
u5.base
u5.test

ua.base    -- The data sets ua.base, ua.test, ub.base, and ub.test
ua.test       split the u data into a training set and a test set with
ub.base       exactly 10 ratings per user in the test set.  The sets
ub.test       ua.test and ub.test are disjoint.  These data sets can
              be generated from u.data by mku.sh.

allbut.pl  -- The script that generates training and test sets where
              all but n of a users ratings are in the training data.

mku.sh     -- A shell script to generate all the u data sets from u.data.

Which tells us that u.data contains the full data set, which is 100,000 ratings by 943 users on 1682 items, where each user rated at least 20 movies. The data is tab separated with column names: user id, item/movie id, rating, and timestamp. So, let's try reading the csv:

ratings = pd.read_csv(
    path/'u.data',
    delimiter = '\t',
    header = None,
    names = ['user', 'movie', 'rating', 'timestamp'])
ratings.head()
user movie rating timestamp
0 196 242 3 881250949
1 186 302 3 891717742
2 22 377 1 878887116
3 244 51 2 880606923
4 166 346 1 886397596

We'd like to know the actual movie name instead of the movie ID, so we can also read the u.item file (although it says it's tab separated, it's actually pipe separated):

movies = pd.read_csv(
    path/'u.item',
    delimiter = '|',
    header = None,
    usecols = [0, 1],
    names = ['movie', 'title'],
    encoding = 'latin-1'
    )
movies.head()
movie title
0 1 Toy Story (1995)
1 2 GoldenEye (1995)
2 3 Four Rooms (1995)
3 4 Get Shorty (1995)
4 5 Copycat (1995)

Then, we can merge the two tables together:

ratings = ratings.merge(movies)
ratings.head()
user movie rating timestamp title
0 196 242 3 881250949 Kolya (1996)
1 63 242 3 875747190 Kolya (1996)
2 226 242 5 883888671 Kolya (1996)
3 154 242 3 879138235 Kolya (1996)
4 306 242 5 876503793 Kolya (1996)

Then, we build our DataLoaders object, which will have our training and validation DataLoaders (which produces our mini-batches of Datasets).

dls = CollabDataLoaders.from_df(ratings, item_name = 'title', bs = 64)
dls.show_batch()
user title rating
0 542 My Left Foot (1989) 4
1 422 Event Horizon (1997) 3
2 311 African Queen, The (1951) 4
3 595 Face/Off (1997) 4
4 617 Evil Dead II (1987) 1
5 158 Jurassic Park (1993) 5
6 836 Chasing Amy (1997) 3
7 474 Emma (1996) 3
8 466 Jackie Chan's First Strike (1996) 3
9 554 Scream (1996) 3

And then we create our model, which will contain our embeddings. We can't just index into a matrix for a deep learning model since we have to calculate the derivative for each operation we do. Instead, we use one-hot encoding, which is a vector that has a 1 in the places that we want to index in. For example, if we have an array [0, 1, 2, 3] and we want the element in the 2nd index (2), we would matrix multiply [0, 0, 1, 0] to the array's transpose: $$\begin{bmatrix}0 & 1 & 2 & 3\end{bmatrix}^T\begin{bmatrix}0&0&1&0\end{bmatrix}=\begin{bmatrix}0\\1\\2\\3\end{bmatrix}\begin{bmatrix}0&0&1&0\end{bmatrix}=\begin{bmatrix}2\end{bmatrix}$$

But, storing and using one-hot encoding vectors are pretty time and memory consuming, so we use a special layer in most deep learning libraries (like PyTorch) called embedding. Embedding is mimicking the process of multiplying by a one-hot-encoded matrix, but it just indexes into a matrix using an integer while having its derivative calculated in a way such that it's identical to what it would've been if a matrix multiplication was done with a one-hot-encoded vector.

Optimizers need to be able to get all the parameters from a model, so all embedding does is randomly initialize a matrix and wrap it around the nn.Parameter class which tells PyTorch that it's a trainable parameter.

When we refer to an embedding, that's the embedding matrix, which is the thing that's multiplied by the one-hot-encoded matrix or the thing that's being indexed into. So, an embedding matrix in this case, is our latent factors (and biases).

When creating a neural network model with PyTorch, we have to inherit from their Module class which contains the essentials; we just have to define __init__ (called dunder init) to initialize our model and forward which is essentially the "predict" step in the model. forward accepts the parameters of a mini-batch and returns a prediction.

n_users = len(dls.classes['user'])
n_items = len(dls.classes['title'])
n_factors = 50
class DotProduct(Module):
        def __init__(self, n_users, n_items, n_factors, y_range = (0, 5.5)):
            # User latent factors and biases
            self.user_factors = Embedding(n_users, n_factors)
            self.user_bias    = Embedding(n_users, 1)
            # Item latent factors and biases
            self.item_factors = Embedding(n_items, n_factors)
            self.item_bias    = Embedding(n_items, 1)
            # Range for our predictions
            self.y_range      = y_range

        def forward(self, x):
            # Get first column (the users) from input
            users = self.user_factors(x[:,0])
            # Get second column (the titles) from input
            items = self.item_factors(x[:,1])
            # Calculate the dot product 
            dot_prod = (users * items).sum(dim = 1, keepdim = True)
            # Add biases to the dot product
            # We add the user biases and the item biases together
            dot_prod += self.user_bias(x[:,0]) + self.item_bias(x[:,1])
            # Return the prediction in the chosen range
            # Sigmoid is a function that returns a value between 0 and 1
            # We can multiply it by (hi - lo) and add lo to get a value
            # between lo and hi, which is what sigmoid_range does
            return sigmoid_range(dot_prod, *self.y_range)

Now that we have our model, we can create an object with it and pass it into a Learner and train it.

model = DotProduct(n_users, n_items, n_factors)
learn = Learner(dls, model, loss_func = MSELossFlat())
# We also use weight decay since we have bias in our model
learn.fit_one_cycle(5, 5e-3, wd = 0.1)
epoch train_loss valid_loss time
0 0.941400 0.941900 00:09
1 0.847874 0.877467 00:08
2 0.719121 0.835374 00:07
3 0.594287 0.824023 00:07
4 0.483335 0.824634 00:07

And, we don't need to define our own DotProduct class. We can instead use fast.ai's collab_learner.

learn = collab_learner(dls, n_factors = 50, y_range = (0, 5.5))
learn.fit_one_cycle(5, 5e-3, wd = 0.1)
epoch train_loss valid_loss time
0 0.940803 0.954099 00:08
1 0.846296 0.874175 00:07
2 0.741423 0.838990 00:07
3 0.590897 0.822672 00:07
4 0.492853 0.823269 00:07

And we see the results are similar since the model used by collab_learner is essentially equivalent:

# 50 latent factors for users and items
# bias for users and items
learn.model
EmbeddingDotBias(
  (u_weight): Embedding(944, 50)
  (i_weight): Embedding(1665, 50)
  (u_bias): Embedding(944, 1)
  (i_bias): Embedding(1665, 1)
)

To turn our architecture into a deep learning model, we need a neural network. With a neural network, we start with a large matrix that we pass through layers. Instead of taking the dot product, we concatenate the latent factors from the users and the items. So, we also don't need the same number of latent factors for users as for items. To get the number of latent factors, we can use fast.ai's get_emb_sz function on our DataLoaders, which will give us recommended latent factors:

embs = get_emb_sz(dls)
embs
[(944, 74), (1665, 102)]

And, we can rewrite our DotProduct class like so:

class SimpleNet(Module):
    def __init__(self, user_sz, item_sz, y_range = (0, 5.5), n_acts = 100):
        # nn.Linear implements bias implicitly, so we
        # don't need to define our own bias.
        self.user_factors = Embedding(*user_sz)
        self.item_factors = Embedding(*item_sz)
        self.layers       = nn.Sequential(
            nn.Linear(user_sz[1] + item_sz[1], n_acts),
            nn.ReLU(),
            nn.Linear(n_acts, 1))
        self.y_range      = y_range

    def forward(self, x):
        embs = self.user_factors(x[:,0]),self.item_factors(x[:,1])
        x    = self.layers(torch.cat(embs, dim = 1))
        return sigmoid_range(x, *self.y_range)

Then, we can put it in a Learner and train our deep learning model:

model = SimpleNet(*embs)
learn = Learner(dls, model, loss_func = MSELossFlat())
learn.fit_one_cycle(5, 5e-3, wd = 0.1)
epoch train_loss valid_loss time
0 0.942289 0.957183 00:07
1 0.918996 0.915120 00:07
2 0.854367 0.902296 00:07
3 0.820374 0.877131 00:07
4 0.827481 0.877810 00:07

And, like how we didn't need to define our own DotProduct class and use collab_learner instead, we can also do the same with SimpleNet.

# We just have to enable the use_nn parameter and
# give it layers
learn = collab_learner(dls, use_nn = True, y_range = (0, 5.5), layers = [100, 50])
learn.fit_one_cycle(5, 5e-3, wd = 0.1)
epoch train_loss valid_loss time
0 1.018050 0.979968 00:13
1 0.906580 0.922872 00:08
2 0.909671 0.890887 00:08
3 0.814160 0.870163 00:08
4 0.802491 0.869587 00:09

Interpreting the results

Now that you've trained a model, there's several ways to interpret your results.

First, we can look at the biases:

# First, take the biases and put them into
# a one-dimensional tensor that we can sort
item_bias = learn.model.item_bias.weight.squeeze()
# argsort returns a list of indexes that would
# sort the tensor
idxs_bot  = item_bias.argsort()[:5]
idxs_top  = item_bias.argsort(descending = True)[:5]
# display the titles of the 5 "worst" movies
# and the 5 "best" movies, respectively
[dls.classes['title'][i] for i in idxs_bot],[dls.classes['title'][i] for i in idxs_top]
(['Children of the Corn: The Gathering (1996)',
  'Lawnmower Man 2: Beyond Cyberspace (1996)',
  'Crow: City of Angels, The (1996)',
  'Beautician and the Beast, The (1997)',
  'Robocop 3 (1993)'],
 ['Titanic (1997)',
  'L.A. Confidential (1997)',
  'Shawshank Redemption, The (1994)',
  "Schindler's List (1993)",
  'Silence of the Lambs, The (1991)'])

Then, we can find the distances:

item_factors = learn.model.item_factors.weight
idx = dls.classes['title'].o2i['Toy Story (1995)']
distances = nn.CosineSimilarity()(item_factors, item_factors[idx][None])
idx = distances.argsort(descending = True)[1:5]
dls.classes['title'][idx]
(#4) ['That Thing You Do! (1996)','Abyss, The (1989)','Wizard of Oz, The (1939)','Aladdin (1992)']

So, the four movies in the data set that are most similar to Toy Story are the ones above.

In the next few blogs, I'll be talking more about deep and machine learning with tabular data.