Gradient descent with Adam
SGD, but with adaptive moms.
An optimizer is a function used in the "step" phase of training where we update the weights based on the gradient of the loss at that point calculated during backpropagation. By default, fastai uses the Adam optimizer.
To compare Adam with previous optimizers, let's train a model using Adam with the Imagenette data set:
def get_dls(url, presize, resize):
path = untar_data(url)
Path.BASE_PATH = path
return DataBlock(
blocks=(ImageBlock, CategoryBlock), get_items=get_image_files,
splitter=GrandparentSplitter(valid_name='val'),
get_y=parent_label, item_tfms=Resize(presize),
batch_tfms=[*aug_transforms(min_scale=0.5, size=resize),
Normalize.from_stats(*imagenet_stats)]
).dataloaders(path, bs=128)
dls = get_dls(URLs.IMAGENETTE_160, 160, 128)
def get_learner(**kwargs):
return vision_learner(dls, resnet34, pretrained=False,
metrics=accuracy, **kwargs).to_fp16()
learn = get_learner()
learn.fit_one_cycle(3, 3e-3)
At the beginning of my fastai journey, I first learned stochastic gradient descent, which I used to train a model on the MNIST data set. SGD looks like
new_weight = weight - lr * weight.grad
Let's train a model using SGD as its optimizer:
learn = get_learner(opt_func=SGD)
lr = learn.lr_find().valley
learn.fit_one_cycle(3, lr, moms=(0, 0, 0))
Unlike previous deep learning libraries, fastai was innovative in that it allowed for callbacks in basically any part of the training process, allowing users to tweak the training process without having to change the actual library code.
A callback is a piece of code that you write and inject somewhere in another piece of code.
We can create different optimizers through optimizer callbacks, which allow us to change how the optimizer takes a step during the optimization step of the deep learning training cycle.
Optimizer
is a class provided by fastai that has two major functions, zero_grad
and step
:
def zero_grad(self):
for p,*_ in self.all_params(with_grad=True):
p.grad.detach_()
p.grad.zero_()
def step(self, closure=None):
if closure is not None: raise NotImplementedError("fastai optimizers currently do not support closure")
for p,pg,state,hyper in self.all_params(with_grad=True):
for cb in self.cbs: state = _update(state, cb(p, **{**state, **hyper}))
self.state[p] = state
When we update the gradients, we take a step using step
and then zero the gradients using zero_grad
, which are both automatically done by fastai's Learner
.
We can initialize Optimizer
with a list of callbacks cbs
that can update state
. state
is a dictionary that allows us to store variables as if we could update the code.
Optimizer callbacks should ultimately update the parameters, shortened to p
in the code. Then, using state
, we can store intermediate values between different callbacks.
So, we can create our own SGD like so and be able to get a similar result as before:
def sgd_cb(p, lr, **kwargs):
p.data.add_(p.grad.data, alpha=-lr)
opt_func = partial(Optimizer, cbs=[sgd_cb])
learn = get_learner(opt_func=opt_func)
learn.fit_one_cycle(3, 1.5e-2, moms=(0, 0, 0))
If our loss function is really noisy, the gradients may be constantly changing directions, which is bad. Instead, what if we "soften" the loss by forcing the gradients to change direction more slowly?
Momentum, according to the Cambridge dictionary, is a "force that keep an object moving". In terms of gradient descent, it's like continuing to descend in the previous direction for a while, rather than instantaneously changing direction. So, if we previously had:
new_weight = weight - lr * weight.grad
Then we can add momentum by incorporating the previous gradients:
weight.avg = beta * weight.avg + (1 - beta) * weight.grad
new_weight = weight - lr * weight.avg
Where beta
is a parameter from 0 to 1 that controls how much momentum we want: as beta
increases, we care less about the current gradient.
def average_grad(p, mom, grad_avg=None, **kwargs):
# mom is short for momentum
if grad_avg is None:
grad_avg = torch.zeros_like(p.grad.data)
return {'grad_avg': mom * grad_avg + (1 - mom) * p.grad.data}
def momentum_step(p, lr, grad_avg, **kwargs):
p.data.add_(grad_avg, alpha=-lr)
opt_func = partial(Optimizer, cbs=[average_grad, momentum_step], mom=0.9)
learn = get_learner(opt_func=opt_func)
learn.fit_one_cycle(3, 3e-2)
Not all parameters are going to be equal, so they also won't have the same gradients; some will be small, some will be large. So, why should we give them all the same learning rate? Here comes adaptive learning rates: we can speed up training by having a global learning rate, but use more of it the smaller the gradients are.
We can have adaptive learning rates by dividing the current gradient by the average of the previous gradients. But, we can't just take the average of the gradients since the sum of similar positive and negative numbers will be 0 (hence, leading to a divide by zero). Instead, we take the square, then sum, then square root.
weight.square_avg = alpha * weight.square_avg + (1 - alpha) * (weight.grad ** 2)
new_weight = weight - lr * weight.grad / math.sqrt(weight.square_avg + eps)
Now, we have two parameters: alpha
and eps
. alpha
is like beta
and eps
(usually a small number like 1e-8
) is added for numerical stability.
def average_square_grad(p, square_mom, square_avg=None, **kwargs):
if square_avg is None:
square_avg = torch.zeros_like(p.grad.data)
return {'square_avg': square_mom * square_avg + (1 - square_mom) * p.grad.data ** 2}
This method is called RMSProp since we're doing root-mean-squared propagation, hence rms_prop_step
:
def rms_prop_step(p, lr, square_avg, eps, grad_avg=None, **kwargs):
denom = square_avg.sqrt().add_(eps)
p.data.addcdiv_(p.grad, denom, value=-lr)
opt_func = partial(Optimizer, cbs=[average_square_grad, rms_prop_step],
square_mom=0.99, eps=1e-8)
learn = get_learner(opt_func=opt_func)
learn.fit_one_cycle(3, 3e-3)
You know, it would be nice if we could have both momentum for the gradients while also having adaptive learning rates. If only... oh, there's Adam. Hypothetically, Adam will stand for adaptive learning rates and momentum, where you take the moving average of the gradients for direction (we were just using the current gradient in RMSProp) and the moving average of the gradients squared for adaptive learning rates.
However, instead of taking the moving average of the gradients, we take the unbiased moving average, which is:
weight_avg = beta * weight_avg + (1 - beta) * weight.grad
unbiased_avg = weight.avg / (1 - beta ** (i + 1))
Where we divide the moving average by 1 - beta ** (i + 1)
which becomes closer to 1 as i
, the i
-th iteration of n
batches in an epoch, reaches n
. The unbiased moving average will look more like the gradients at the beginning (weight_avg
is initialized as 0
s so we don't want mostly 0
gradients at the beginning) and become closer to the normal moving average over time.
Overall, Adam looks like:
weight_avg = beta * weight_avg + (1 - beta) * weight.grad
unbiased_avg = weight.avg / (1 - beta ** (i + 1))
weight.square_avg = alpha * weight.square_avg + (1 - alpha) * (weight.grad ** 2)
new_weight = weight - lr * unbiased_avg / math.sqrt(weight.square_avg + eps)
The default for (beta, alpha)
is (0.9, 0.99)
and eps
is 1e-8
.
In the end, we get the combined benefits of momentum and adaptive learning rates to get:
learn = get_learner()
learn.fit_one_cycle(3, 3e-3)
Weight decay can be thought in two ways. I mostly thought of it as adding the square of the weights to the loss, where it can instead be added to the gradient like:
weight.grad += wd * weight
The other way is to decay the weights by lr * wd
and add it to the new weights like
new_weight = weight - lr * weight.grad - lr * wd * weight
The two are equivalent through the lens of vanilla SGD. However, when we introduce momentum, we can no longer think of it as adding wd * weight
to weight.grad
since we'd end up having something like:
weight_avg = beta * weight_avg + (1 - beta) * (weight.grad + wd * weight)
Instead, we use the second method. With Adam, it would look like:
new_weight = weight - lr * unbiased_avg / math.sqrt(weight.square_avg + eps) - lr * wd * weight
Althought fastai uses Adam as its default optimizer, which you can verify with the summary we had in this blog, there are also a variety of other ones that fastai provides here. The main takeaway for this blog would be momentum and adaptive learning rates.
Momentum allows us to train the model with more stability since we aren't instantaneously changing direction according to the new gradients. Instead, we slowly change direction through a moving average of the previous gradients.
With the idea of moving averages, we have adaptive learning rates that allow each parameter to be special and learn at its own, although forcefully accelerated, pace.
Combining these two ideas give us Adam, which has shown to be much more efficient than vanilla SGD, and using only one of the two ideas (SGD with momentum and RMSProp).