What is collaborative filtering?

You want to start a movie streaming service. You're learning machine learning and you want a way to recommend people movies so they watch more and spend more time on your app so you get more ad revenue. How would you solve this problem?

Let's look at Bingus. He's a McMaster alumni turned McDonald's employee. He works an 8 AM to 6 PM shift flipping burgers for minimum wage and gets home to watch some movies. He really likes this one movie called "Cars 3". He's watched dozens of times now and he won't stop.

Then comes Alice. Unlike Bingus, she graduated from Waterloo and landed a CS job earning six figures. She even graduated with the highest GPA of her year and landed that CS job through a return offer from her summer coop. So what? Alice and Bingus both like "Cars 3".

Since they both like "Cars 3", we could recommend Alice movies that Bingus also likes and recommend Bingus movies that Alice also likes.

Collaborative filtering: look at items the current user used or liked, find other users that used or liked similar items, then recommend those items to the current user.

Collaborative filtering works through latent factors, which are basically the "tags" you would give an item. For example, you could give a movie tags like "science fiction," "action," "old," "horror," and "romance." It's latent because it depends on users for the factors to have meaning.

What's wild is we never tell our model the latent factors. We just say how many we want and the model learns them on its own. So, we don't need to know anything about the items (in a descriptive way); we just need the data on the users and the items.

How do you implement collaborative filtering?

Our model will work using stochastic gradient descent. So, we need three things:

random parameters;
a way to calculate our predictions; and
a loss function.

We first randomly initialize some parameters for our latent factors. Let's say we want our model to learn 5 latent factors. Then, each user and item would have 5 parameters (or, a 5-dimensional vector). We can use the dot product between the user and the item as our prediction of how likely we would recommend an item to a user. The higher the value, the more likely the user will like that product.

Finally, we need to pick a loss function. Since we're dealing with regression instead of classification (like digit classification), we can use L1 norm or L2 norm.

L1 norm (absolute mean difference): take the absolute difference between two values and take the mean.

$$L_1=\sum^n_{i=0}\frac{|a_i-b_i|}{2}, \,a\in A, \, b\in B$$

L2 norm (root mean squared error): take the square of the differences between two values and take the mean. Then, square root the result. $$L_2=\sqrt{\sum^n_{i=0}\frac{(a_i-b_i)^2}{2}}, \,a\in A, \, b\in B$$

What's the difference? L2 norm puts a larger emphasis on small and large changes because of squaring since large * large = larger and small * small = smaller.

Now, we have all we need for stochastic gradient descent: random parameters, a way to calculate our predictions, and our loss function.

At each step, the SGD optimizer calculates the match between each item and user (random parameters) using the dot product (procedure for predictions), and compares it to the actual rating that each user gave to each item (loss function). Then, it calculates the derivative of this value (gradient) and steps the weights by multiplying the calculated derivative by the learning rate and subtracting the weights by that value (descent).

After each epoch, the loss gets better (lower) and the recommendations will also get better.

But something's missing...

There's usually a range for ratings on an item. Like "out of 5 stars". So, you should put a range on your predictions so that they're in a similar range as the rating system you have for an item. For example, if a movie service has a rating system from 0 to 5, you should have a range from 0 to 5.5. It's been discovered empirically that having the upper bound a little bit over returns better results.

And, remember how parameters are the weights and biases? We should also have biases attached to each user and each item. Some users may be more positive or negative than others. And, some items may be superior than others. It could also reflect current trends. Nonetheless, adding biases on their own usually leads to overfitting.

So, you also need to add L2 regularization,

L2 regularization (weight decay): add the sum of all the weights squared to your loss function.

Why? Because when you compute the gradients, the added sum encourages the weights to be as small as possible. It prevents overfitting because having high parameters lead to sharper changes in the loss function, which can lead to overfitting. So, having smaller parameters encouraged by weight decay decreases that.

However, you don't apply weight decay by adding the sum of all weights squared to the loss function (it would be inefficient and lead to huge numbers). Instead, add double the parameters to the gradient since the derivative of $x^2$ is $2x$. And, weight decay is just a hyperparameter that you multiply $2x$ by, so what you actually do is gradient += wd * 2 * parameters, which is essentially gradient += wd * parameters (2 is incorporated into wd like how you just have + C for integrals). The end result of adding biases with weight decay is that we make training the model a bit harder, but the model will generalize better in practice.

Interpreting the model

Now, you finished training your model. You have your biases and latent factors (weights) all set. How can you interpret your parameters before putting your model in action?

With biases, you can sort items to see

current trends; and
which items are good (high bias) or bad (low bias).

Interpreting the latent factors is a bit trickier in that you can't just model it. But, there is a technique called principal component analysis, which lets you take the most important directions in the latent factors.

There's a simpler alternative if you just want to compare a few items: you can calculate the "distance" between two items. If two items were very similar, then their latent factors would also be similar. So, their "distance" would be low compared to the distance between a more different item. Ultimately, item similarity in a model is dictated by the similarity of users that like those items.

To calculate to distance, you use Pythagoras' formula:

$$d=\sqrt{(x_2-x_1)^2+(y_2-y_1)^2}$$

except you would do this for how many dimensions there are. For example, the distance between two 50-dimensional embedding (the parameters) would be

$$d=\sqrt{(x_{2,1}-x_{1,1})^2+(x_{2,2}-x_{1,2})^2+\dots+(x_{2,50}-x_{1,50})^2}$$

So, say you have one movie, "Cars 3" and two other movies: "Cars 4" and "Harry Potter". The distance between "Cars 3" and "Cars 4" might be 50, while it's 100 for "Cars 3" and "Harry Potter". Then, since the distance for the former is shorter, you could infer that "Cars 3" is more similar to "Cars 4" than "Cars 3" is to "Harry Potter".

But wait, there's a problem

We all know by now that overfitting is a big problem in the training process. But, there's an equally important problem in practice for collaborative filtering: the bootstrapping problem.

Bootstrap: to start something with little help.

Bootstrapping problem: what do you do when you have no users (no data) to train your model; and, if you do have previous users, what do you recommend for a new user? Similarly, what do you do when you add a new item?

Like overfitting, there isn't a solution that works for everything, but there are some used commonly:

assign the mean/median of all the latent factors to the new user or item;
pick a specific user or item to represent the average user;
survey the new user or item to construct a basic set of latent factors for them.

However, solutions to the bootstrapping problem leads to another problem: positive feedback loops. A small number of otakus can set the recommendation for the entire user base. You might expect this feedback loop to be an outlier, but it's actually the norm. For example, even though not a lot of people watch anime, a few people really enjoy "Demon Slayer"; so when the movie came out, it became highly recommended for the general user base. Similarly, "Squid Game" also became popular this way along with many Korean movies like "Parasite" and "Train to Busan". It's only when the systems do something about it (like deliberately lower its bias, don't recommend it anymore, or through time) that the feedback loop stops.

The bias for certain items in the latent factors may be due to representation bias. If you don't want your entire user base (and your system) to change, then you have to be wary of these feedback loops. Once the bias becomes too high, more people of the same group come along and your user base ends up being that group.

An easy way to prepare for feedback loops is to integrate your model slowly:

first, have people monitor the model and its recommendations;
then, monitor the recommendations over time; and
eventually let the model recommend on its own.

But what about deep learning?

Our first method isn't deep learning, but instead called probabilistic matrix factorization. Instead of a neural network, we used the dot product to calculate our predictions.

With deep learning, we need a neural network, which contains all the layers with parameters that we optimize in each epoch. So, we also don't need the same number of latent factors for items and users since we won't be using the dot product.

What you'll find is that deep learning is a bit worse than probabilistic matrix factorization on its own. But, you can add other user and item information, date and time information, and/or any other information that might be relevant to the recommendation.