So what is deep learning? From what I've seen so far of the fast.ai course (chapters 1 to 7), deep learning is training a model with labelled data such that it is able to predict labels for unlabelled data in practice.

With a standard algorithm, you get inputs, you give it to the algorithm, and you get an output; you know the code in the algorithm (or at least, you should understand it). It's similar in a deep learning model: you know how the model is going to be trained (by hyperparameters like loss function, architecture like resnet18, optimizer like SGD, number of epochs, etc.) and you can look into the parameters (what the model learned; e.g., in vision models you can look at the layers and see how the earlier layers learn things like edges, gradients, etc. while the later layers learn more of the complex things like dog faces).

The difference is that you didn't hard code the parameters - hopefully - like how you hard coded the algorithm (or copied from a library or Stack Overflow). You have to train the model by giving it labelled data. Emphasis on labelled data. The model can't learn if it doesn't have anything to learn from. Why? Because you're creating a model to predict labels so your data should be labelled. So, a lot of deep learning ends up just labelling your data correctly (or labelling your data in general).

Then, there's overfitting - an obstacle that prevents your model from being useful. Imagine this: your model predicts your training data with near 100% accuracy. Congrats! You should get an award! Oh, wait, your model can only predict the validation set with 90% accuracy. Oh, double wait, your model can only predict the test set with 50% accuracy... WHAT'S HAPPENING? Your model is memorizing (over-fitting) your training data instead of actually learning from it.

So, you should be careful when training your model. There's so many options you can tune when training a model. From what I've seen so far, you usually overfit by

  • having a large number of epochs,
  • using too large or small learning rate,
  • having a poorly split dataset (like having a random split on a time series data), and
  • making some post-training decisions that you end up overfitting on the validation set, and
  • not using a pre-trained model (if one exists for your intentions).

Remember, you want your model to be good at generalizing on previously unseen data... unless you have a dataset that's so large that there is no previously unseen data. It doesn't matter how well your model does on your training and validation sets if it doesn't work with the test set or a random piece of data you give it!

Then, What is a Deep Learning Model?

A deep learning model consists of layers. Remember how I mentioned resnet18? That's an architecture (pre-trained too, which means it's been trained on a dataset before and we're retraining it with our own dataset which saves time and money since we'll start with a pretty high accuracy). An architecture is the skeleton of a deep learning model. It contains all the layers and parameters. resnet18 as you could assume, contains 18 layers.

A layer is technically composed of two things. The first is a linear layer ($w\cdot x+b$, where $w$ are the weights, $x$ is the data or input, and $b$ is the bias; what people call the parameters of a model is the weights and biases) followed by a non-linear layer typically a ReLU (rectified linear unit, which turns all negative numbers to 0 and keeps positive numbers as is). So, you may be wondering: why do we need both? Well, if you only had linear layers, you could combine all of them into a single layer. So, we need some kind of nonlinearity so that we can prevent the linear layers from becoming one big linear layer.

Now, lets say you have an image. We transform the data, for example through resizing and augmenting it (sort of like distorting it, but in a good way), such that it's $224\times 224$ pixels large. That's $50176$ pixels. Then, there could be $4$ more categories for red, green, blue, and alpha (transparency). That becomes $200704$ input values for the first layer. With each layer, you want to decrease that value until the last layer, where you'll have $n$ outputs since you have $n$ different labels.

Then, how does the model actually learn? Every time you pass a piece of data and you get a prediction, that prediction is passed onto a function called the loss function (what you care about is the metric function, the computer cares about the loss function), using that loss function, each parameter takes a step in the opposite direction of the slope of that loss function at that paramter's value such that by the next prediction, the value of the loss function is smaller. How big is that step? It's determined by the learning rate you chose. That's why it's important to have a good loss function and a reasonable learning rate.

And, you may be saying, "that's cool and all, but how does that step actually work?" Each number in a layer (weights and biases) are set so that they keep track of what functions are applied to them. So, when you take a "step" (or optimize; hence, "optimizer" like SGD), you first calculate the function of the derivates of the functions applied through chain rule and takes the value of that gradient (so if the derivative is $f'$ and the number is $x$, the value would be $f'(x)$). Then, with the learning rate, you take a step in the opposite direction such that you descend (parameters -= gradient * learning_rate). Now you can guess why they called the process "stochastic gradient descent." You descend the loss function (which should be easily differentiable and typically have non-zero gradients) based on the gradient and the parameters are usually randomly assigned, hence "stochastic."

With each epoch (which is one complete pass through the data), the parameters are finetuned for your task, eventually memorizing the training set. You don't want that, so you train it for as few epochs as you need (to prevent overfitting and to save time).

Remember when I said "you care about the metric function?" That's what you want to pay attention to after each epoch. Metric functions are typically things like accuracy or error rate (which is 1 - accuracy), which are helpful for us to analyze our model, but terrible for our model to use as loss functions.

With an overfitting model, you usually see two things: the loss for the training set continues to decrease, but the loss for the validation set is suddenly increasing. The opposite is true for the metric: the metric for the training set is increasing, but the metric for the validation set is decreasing. That's why you typically want a test set. A training set is the set your model sees when it's training. A validation set is for your eyes only and used to test how well the model is learning. A test set is for no one's eyes. Only for God's eyes or whatever you believe in. Once you finished training your model and tuned all the hyperparameters you want, a test set will tell you how well your model actually trained.

Of course, the validation set and training set are going to be useless if you split your data badly. The common example is with a time series dataset. If you split your data randomly, the model can easily predict intermediate values, but why would you want to know about the past? You want to predict the future. So, you could have your training set consist of data not containing the most recent 6 months, for example. And your validation set consists of the remaining 6 months. You could even take the most recent month out and put the first 5 months into your validation set instead and the most recent month into the test set. Whenever you split your data, make sure it's with the intention that the model will be learning the patterns of the data such that it can generalize to future, never-seen-before data.

That's it?

Well, that's basically what I got so far from fast.ai. With Arthur Samuel's definition of deep learning, you essentially have a model, where you give it labelled data as input, it produces predictions, then those predictions are passed to a loss function, then you optimize the predictions and repeat for $n$ many epochs until you reach a certain point and stop training the model.

There are many parts that you can change in a model with some that you have to change depending on the task, particularly the loss function.

I've mainly been focusing on computer vision since that's what the first 7 chapters were about (apart from chapter 1; I'm also leaving chapter 3 for last, which is about ethics).

Next time you see me, I'll be talking about tabular data.