In a previous blog, we used a pretrained model that used the AWD-LSTM architecture. This architecture is built off a recurrent neural network. "Recurrent", according to Cambridge Dictionary means "happening again many times". And it just so happens that a recurrent neural network is a neural network with layers that happen again (repeat) many times.

To go over RNNs in this blog, we'll be using the human numbers data set that contains the first 10,000 numbers written out in English.

We'll download the data set from fastai's URLs class:

from fastai.text.all import *
path = untar_data(URLs.HUMAN_NUMBERS)

Then, we'll see how the data set is laid out:

path.ls()
(#2) [Path('train.txt'),Path('valid.txt')]

There's two text files that contain the numbers. Since we're creating a language model, we'll concatenate them:

lines = L()
with open(path/'train.txt') as f: 
    lines += L(f.readlines())
with open(path/'valid.txt') as f:
    lines += L(f.readlines())
lines
(#9998) ['one \n','two \n','three \n','four \n','five \n','six \n','seven \n','eight \n','nine \n','ten \n'...]

Then, we can join the lines together, separated by dots so that we can tokenize them:

text = ' . '.join([i.strip() for i in lines])
text[:100]
'one . two . three . four . five . six . seven . eight . nine . ten . eleven . twelve . thirteen . fo'

We'll then tokenize them by splitting them according to spaces:

tokens = text.split(' ')
tokens[:10]
['one', '.', 'two', '.', 'three', '.', 'four', '.', 'five', '.']

We first joined them with a period instead of spaces because the spaces between the words are significant. We want to separate numbers, not words, as in, we want this:

text[text.rindex('.'):]
'. nine thousand nine hundred ninety nine'

Not this:

nine . thousand . nine . hundred . ninety . nine

Next, we'll create our vocab by making a list of the unique tokens:

vocab = L(tokens).unique()
vocab
(#30) ['one','.','two','three','four','five','six','seven','eight','nine'...]

And then we'll numericalize the tokens. In this blog, we'll be keeping the notation of input_target like "input" to (_) "target", so t_i means token to index:

t_i  = {t: i for i, t in enumerate(vocab)}
nums = L(t_i[t] for t in tokens)
nums[:10]
(#10) [0,1,2,1,3,1,4,1,5,1]

We want our model to predict the next word given the last 3 words in the sequence, so we can do that with just Python:

seqs_tok = L((tokens[i:i+3], tokens[i+3]) for i in range(0, len(tokens)-4, 3))
seqs_tok
(#21031) [(['one', '.', 'two'], '.'),(['.', 'three', '.'], 'four'),(['four', '.', 'five'], '.'),(['.', 'six', '.'], 'seven'),(['seven', '.', 'eight'], '.'),(['.', 'nine', '.'], 'ten'),(['ten', '.', 'eleven'], '.'),(['.', 'twelve', '.'], 'thirteen'),(['thirteen', '.', 'fourteen'], '.'),(['.', 'fifteen', '.'], 'sixteen')...]

And since it looks right with the tokens, we'll do the same with the numericalized tokens (and it should look like the above, but with numericalized tokens instead):

seqs = L((tensor(nums[i:i+3]), nums[i+3]) for i in range(0, len(nums)-4, 3))
seqs
(#21031) [(tensor([0, 1, 2]), 1),(tensor([1, 3, 1]), 4),(tensor([4, 1, 5]), 1),(tensor([1, 6, 1]), 7),(tensor([7, 1, 8]), 1),(tensor([1, 9, 1]), 10),(tensor([10,  1, 11]), 1),(tensor([ 1, 12,  1]), 13),(tensor([13,  1, 14]), 1),(tensor([ 1, 15,  1]), 16)...]

Then, we'll make our DataLoaders:

cut = int(len(seqs) * 0.8)
dls = DataLoaders.from_dsets(seqs[:cut], seqs[cut:], bs=64, shuffle=False)

And check that we can make a batch:

dls.one_batch()

(tensor([[ 0,  1,  2],
         [ 1,  3,  1],
         [ 4,  1,  5],
         [ 1,  6,  1],
         [ 7,  1,  8],
         [ 1,  9,  1],
         [10,  1, 11],
         [ 1, 12,  1],
         [13,  1, 14],
         [ 1, 15,  1],
         [16,  1, 17],
         [ 1, 18,  1],
         [19,  1, 20],
         [ 1, 20,  0],
         [ 1, 20,  2],
         [ 1, 20,  3],
         [ 1, 20,  4],
         [ 1, 20,  5],
         [ 1, 20,  6],
         [ 1, 20,  7],
         [ 1, 20,  8],
         [ 1, 20,  9],
         [ 1, 21,  1],
         [21,  0,  1],
         [21,  2,  1],
         [21,  3,  1],
         [21,  4,  1],
         [21,  5,  1],
         [21,  6,  1],
         [21,  7,  1],
         [21,  8,  1],
         [21,  9,  1],
         [22,  1, 22],
         [ 0,  1, 22],
         [ 2,  1, 22],
         [ 3,  1, 22],
         [ 4,  1, 22],
         [ 5,  1, 22],
         [ 6,  1, 22],
         [ 7,  1, 22],
         [ 8,  1, 22],
         [ 9,  1, 23],
         [ 1, 23,  0],
         [ 1, 23,  2],
         [ 1, 23,  3],
         [ 1, 23,  4],
         [ 1, 23,  5],
         [ 1, 23,  6],
         [ 1, 23,  7],
         [ 1, 23,  8],
         [ 1, 23,  9],
         [ 1, 24,  1],
         [24,  0,  1],
         [24,  2,  1],
         [24,  3,  1],
         [24,  4,  1],
         [24,  5,  1],
         [24,  6,  1],
         [24,  7,  1],
         [24,  8,  1],
         [24,  9,  1],
         [25,  1, 25],
         [ 0,  1, 25],
         [ 2,  1, 25]]),
 tensor([ 1,  4,  1,  7,  1, 10,  1, 13,  1, 16,  1, 19,  1,  1,  1,  1,  1,  1,
          1,  1,  1,  1, 21, 21, 21, 21, 21, 21, 21, 21, 21, 22,  0,  2,  3,  4,
          5,  6,  7,  8,  9,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1, 24, 24, 24,
         24, 24, 24, 24, 24, 24, 25,  0,  2,  3]))

For a model that takes in 3 words as input and tries to predict the next word, we can:

  1. Calculate the embeddings for the first word,
  2. Pass the embeddings into a linear layer,
  3. Apply a nonlinearity (like ReLU or softmax),
  4. Calculate the embeddings for the second word,
  5. Add the embeddings to the activations from step 3,
  6. Pass the activations into the same linear layer in step 2,
  7. Apply a nonlinearity, and
  8. Repeat steps 4 to 7 with the third word.

By adding the next word's embeddings to the previous activations, every word is interpreted in the context of the preceding words. And, we use the same weight matrix (linear layer) since the way one word influences the activations from the previous words shouldn't change depending on the position of the word; so, we force the layer to learn all positions instead of limiting each layer to one position.

To turn this idea into code, we can make a model by inheriting from PyTorch's Module class:

class RNNish(Module):
    """
    We have three different states:
      -  Input  (the words)
      -  Hidden (activations)
      -  Output (the probabilities for the next word)
    
    We then have three different layers:
      -  i_h: input to hidden
           -  The embedding matrix to turn our words into embeddings
      -  h_h: hidden to hidden
           -  Calculates the activations for the next word
      -  h_o: hidden to output
           -  Calculates the predictions for the next word
    """
    def __init__(self, n_vocab, n_hidden):
        self.i_h = nn.Embedding(n_vocab, n_hidden)
        # If we want a more complex model, 
        # we would be altering this
        # hidden to hidden layer into more layers
        self.h_h = nn.Linear(n_hidden, n_hidden)
        self.h_o = nn.Linear(n_hidden, n_vocab)

    # This is what our steps would look like
    def forward(self, x):
        h = self.i_h(x[:,0])
        h = F.relu(self.h_h(h))
        h = h + self.i_h(x[:,1]) 
        h = F.relu(self.h_h(h))
        h = h + self.i_h(x[:,2]) 
        h = F.relu(self.h_h(h))
        return self.h_o(h)

If you took any intro to CS course, you might've had a point where you didn't learn while loops yet so you were copy pasting if statements and changing a couple numbers here and there. Well, we know loops so we turn our repetitive "calculate the next embeddings, add it to the hidden state, then calculate the next activations" into a loop.

A hidden state is the activations that're updated at each step of a recurrent neural network (which we can see below in the for loop):

class RNN(Module):
    def __init__(self, n_vocab, n_hidden):
        self.i_h = nn.Embedding(n_vocab, n_hidden)
        self.h_h = nn.Linear(n_hidden, n_hidden)
        self.h_o = nn.Linear(n_hidden, n_vocab)

    # This is how we can simplify to turn it
    # into a recurrent (looped) neural network
    def forward(self, x):
        # We can set it to 0 because tensors have
        # a thing called "broadcasting" that tries
        # to expand the smaller shape tensor into
        # the same shape as the other one
        h = 0
        for i in range(3):
            h = h + self.i_h(x[:,i])
            h = F.relu(self.h_h(h))
        return self.h_o(h)

So, a recurrent neural network is a neural network that's defined using a loop, hence recurrent. An RNN that isn't using a loop like RNNish is the unrolled representation of an RNN.

When we train a model with these two architectures, we should have about the same accuracy:

learn = Learner(dls, RNNish(len(vocab), 64), loss_func=F.cross_entropy, metrics=accuracy)
learn.fit_one_cycle(4, 1e-3)
epoch train_loss valid_loss accuracy time
0 1.794642 1.953518 0.473497 00:02
1 1.401166 1.721939 0.475398 00:02
2 1.425844 1.658773 0.492750 00:02
3 1.379972 1.654874 0.490373 00:02
learn = Learner(dls, RNN(len(vocab), 64), loss_func=F.cross_entropy, metrics=accuracy)
learn.fit_one_cycle(4, 1e-3)
epoch train_loss valid_loss accuracy time
0 1.852343 1.971172 0.464226 00:02
1 1.371547 1.797724 0.475160 00:02
2 1.413824 1.689967 0.491324 00:02
3 1.366402 1.645036 0.492988 00:02

And, we get about 49% for each. To see if it's actually good, we can compare it to if we just predicted the most commonly occurring token each time instead:

n, counts = 0, torch.zeros(len(vocab))
for x, y in dls.valid:
    n += y.shape[0]
    for i in range_of(vocab):
        counts[i] += (y == i).long().sum()
idx = torch.argmax(counts)
idx, vocab[idx.item()], counts[idx].item()/n
(tensor(29), 'thousand', 0.15165200855716662)

So, if we just predicted the most commonly occurring token, "thousand", each time, we would have an accuracy of 15%, so our basic language model with an accuracy of 49% is much better.

You might be wondering, why don't you just use h += ... instead of h = h + ...? I thought so too, but you get a RuntimeError by PyTorch because you're using a tensor or its part to compute the tensor or its part; in other words, PyTorch can't calculate the gradient when you use +=. You can read more on why here.

Improving the simple RNN

Currently, we're initializing the hidden state h to 0 each time in forward. This effectively makes the model forget what it's seen before. To fix this, we can initialize h in __init__ and create a reset function to reinitialize h to 0.

But, this creates another problem: as we apply another layer to h, we add another thing on which we have to calculate the derivative during backpropagation. So, we can use PyTorch's detach method on h, which removes the gradient history of h (technically, it makes h no longer require gradient).

Overall, we made our new RNN stateful since it remembers its activations between different samples in the batch (between different calls to forward):

class RNN2(Module):
    def __init__(self, n_vocab, n_hidden):
        self.i_h = nn.Embedding(n_vocab, n_hidden)
        self.h_h = nn.Linear(n_hidden, n_hidden)
        self.h_o = nn.Linear(n_hidden, n_vocab)
        self.h   = 0

    def forward(self, x):
        for i in range(3):
            self.h = self.h + self.i_h(x[:,i])
            self.h = F.relu(self.h_h(self.h))
        out    = self.h_o(self.h)
        self.h = self.h.detach()
        return out
    
    def reset(self):
        self.h = 0

We can also have any sequence length we want since we'll have the same activations each time. The only difference is that we only calculate the gradients on sequence length tokens in the past instead of all of them. This approach is called truncated backpropagation through time (truncated BPTT).

BPTT is treating an RNN as one big model (which we did by initializing h to 0 in __init__), and calculating gradients on it the usual way. Truncated BPTT avoids running out of memory and time by "detaching" the history of computation steps in the hidden state every (or few) epochs (which we did by reinitializing h to 0 in reset).

To make our model work, we need it to see the data set in order, such that dset[0] is in the first line of the first batch, dset[1] is in the first line of the second batch, and so on. For the other lines, we can split the data set into chunks of size m = len(dset) // bs, so dset[i + j * m] is in the j+1-th line of the i+1-th batch). This is done automatically in LMDataLoader.

The following function does the reindexing:

def group_chunks(dset, bs):
    m = len(dset) // bs
    new_dset = L()
    for i in range(m):
        new_dset += L(dset[i + j * m] for j in range(bs))
    return new_dset

Then, when we make our DataLoaders, we also need to drop the last batch since it might not be of size bs. We also need to avoid shuffling the data since that would ruin the purpose of our reindexing.

bs  = 64
cut = int(len(seqs) * 0.8)
dls = DataLoaders.from_dsets(
        group_chunks(seqs[:cut], bs), 
        group_chunks(seqs[cut:], bs), 
        bs=bs, drop_last=True, shuffle=False)

Finally, we need to adjust the training loop so that we call reset. We can do this by adding ModelResetter as a Callback (cbs), which calls reset before each epoch and each validation phase. Since we reinitialize the hidden state, we start with a clean state before each batch so we can train for more epochs.

learn = Learner(dls, RNN2(len(vocab), 64), loss_func=F.cross_entropy, 
                metrics=accuracy, cbs=ModelResetter)
learn.fit_one_cycle(10, 3e-3)
epoch train_loss valid_loss accuracy time
0 1.739974 1.857636 0.474038 00:01
1 1.270585 1.779141 0.451683 00:01
2 1.106414 1.576123 0.522356 00:01
3 1.020932 1.581516 0.552644 00:01
4 0.954357 1.765170 0.551683 00:01
5 0.894364 1.761537 0.568510 00:01
6 0.850844 1.735529 0.553365 00:01
7 0.813397 1.583861 0.581010 00:01
8 0.766258 1.656481 0.605529 00:01
9 0.751839 1.691648 0.609135 00:01

Turning it more like a language model

Remember how when we took the movie review data set, we made the independent variable and dependent variable the same token length, but the dependent variable was ahead by one token? By doing so, we get more signal that we can feed back to the model when we update the weights. Why predict the last word of the sequence when you can predict the next word for each word in the sequence, right?

So, we can adjust our seqs to be of sl length for both independent and dependent variables, with them offset by one token:

sl      = 16
seqs_lm = L((tensor(nums[i:i+sl]), tensor(nums[i+1:i+1+sl]))
            for i in range(0, len(nums)-1-sl, sl))
[L(vocab[j] for j in seq) for seq in seqs_lm[0]]
[(#16) ['one','.','two','.','three','.','four','.','five','.'...],
 (#16) ['.','two','.','three','.','four','.','five','.','six'...]]

Then we can make our DataLoaders the same way as before:

bs  = 64
cut = int(len(seqs_lm) * 0.8)
dls = DataLoaders.from_dsets(
        group_chunks(seqs_lm[:cut], bs), 
        group_chunks(seqs_lm[cut:], bs), 
        bs=bs, drop_last=True, shuffle=False)

But, we need to change our model so that it predicts after every word and not after the last one:

class RNN3(Module):
    def __init__(self, n_vocab, n_hidden):
        self.i_h = nn.Embedding(n_vocab, n_hidden)
        self.h_h = nn.Linear(n_hidden, n_hidden)
        self.h_o = nn.Linear(n_hidden, n_vocab)
        self.h   = 0

    def forward(self, x):
        outs = []
        # We changed 3 to sl since we'll be
        # predicting the next word sl times
        for i in range(sl): 
            self.h = self.h + self.i_h(x[:,i])
            self.h = F.relu(self.h_h(self.h))
            outs.append(self.h_o(self.h))
        self.h = self.h.detach()
        return torch.stack(outs, dim=1)
    
    def reset(self):
        self.h = 0

And, we'll have to flatten the inputs and targets before using them in F.cross_entropy. The output of the model has a shape bs $\times$ sl $\times$ n_vocab since we stacked the output onto one dimension (through dim=1). Our targets have shape bs $\times$ sl. So, we can reshape them using torch.view.

def loss_func(input, target):
    # .view(-1, len(vocab)) means make len(vocab)
    # columns with as many rows as needed (-1)
    #
    # .view(-1) means flatten the entire tensor
    # into one row that's as long as it needs to be
    return F.cross_entropy(input.view(-1, len(vocab)), target.view(-1))

Finally, we can train our model. We'll have to use an even larger number of epochs than last time because we have a more complex model:

learn = Learner(dls, RNN3(len(vocab), 64), loss_func=loss_func,
                metrics=accuracy, cbs=ModelResetter)
learn.fit_one_cycle(15, 3e-3)
epoch train_loss valid_loss accuracy time
0 3.321696 3.182977 0.186442 00:01
1 2.411823 1.992369 0.469482 00:01
2 1.792959 1.769369 0.451253 00:01
3 1.494330 1.655577 0.501872 00:01
4 1.304660 1.634182 0.549967 00:01
5 1.152728 1.675661 0.584391 00:01
6 1.021364 1.745027 0.623698 00:01
7 0.910983 1.595549 0.588135 00:01
8 0.828727 1.685101 0.628092 00:01
9 0.753475 1.665891 0.615479 00:01
10 0.709412 1.602229 0.629069 00:01
11 0.660277 1.695538 0.648031 00:01
12 0.638024 1.641410 0.637695 00:01
13 0.617738 1.700257 0.645426 00:01
14 0.601591 1.764060 0.651204 00:01

We got a better accuracy, but we have an effectively very deep network. So, we can end up with very small or very large gradients that can lead to very different results when we run train the model:

learn = Learner(dls, RNN3(len(vocab), 64), loss_func=loss_func,
                metrics=accuracy, cbs=ModelResetter)
learn.fit_one_cycle(15, 3e-3)
epoch train_loss valid_loss accuracy time
0 3.209394 3.080578 0.263021 00:01
1 2.302940 1.904129 0.468262 00:01
2 1.727188 1.796725 0.468994 00:01
3 1.426145 1.770710 0.494548 00:01
4 1.230578 1.784142 0.498291 00:01
5 1.093605 1.889311 0.513591 00:01
6 0.975858 1.930008 0.532389 00:01
7 0.885223 2.018722 0.535319 00:01
8 0.814634 2.065933 0.538656 00:01
9 0.757224 2.108545 0.561686 00:01
10 0.712159 2.211724 0.544678 00:01
11 0.677356 2.263988 0.575033 00:01
12 0.650655 2.421612 0.533610 00:01
13 0.626848 2.389696 0.546549 00:01
14 0.614016 2.390650 0.552246 00:01

By training a new model, we got a decrease of nearly 10% in accuracy. One way to fix this would be to try a deeper model: one with more than one linear layer between the hidden state and the output activations.

More layers. MORE LAYERS.

A multilayer RNN is more like a multiRNN model: we pass the activations from one RNN as inputs to another RNN.

This time, instead of creating a for loop, we can use PyTorch's nn.RNN class, which implements it for us while also letting us choose how many layers we want:

class RNN4(Module):
    def __init__(self, n_vocab, n_hidden, n_layers):
        self.i_h = nn.Embedding(n_vocab, n_hidden)
        # Our inputs are in order of (bs, sl, n_vocab) so we have to
        # tell PyTorch we want bs first instead of sl first
        self.rnn = nn.RNN(n_hidden, n_hidden, n_layers, batch_first=True)
        self.h_o = nn.Linear(n_hidden, n_vocab)
        self.h   = torch.zeros(n_layers, bs, n_hidden)

    def forward(self, x):
        acts, h = self.rnn(self.i_h(x), self.h)
        self.h  = h.detach()
        return self.h_o(acts)

    def reset(self):
        self.h = self.h.zero_()

But, when we train our model, we get a worse accuracy than our previous single-layer RNN:

learn = Learner(dls, RNN4(len(vocab), 64, 2), 
                # CrossEntropyLossFlat() does the 
                # same thing as our loss_func 
                loss_func=CrossEntropyLossFlat(),
                metrics=accuracy, cbs=ModelResetter)
learn.fit_one_cycle(15, 3e-3)
epoch train_loss valid_loss accuracy time
0 3.089907 2.664662 0.444417 00:01
1 2.179285 1.815480 0.471354 00:01
2 1.719844 1.864228 0.323079 00:01
3 1.518661 1.860099 0.433594 00:01
4 1.405865 1.868557 0.475911 00:01
5 1.304332 1.898743 0.479329 00:01
6 1.194302 2.151057 0.470785 00:01
7 1.086549 2.219379 0.503988 00:01
8 0.950983 2.215469 0.501709 00:01
9 0.839067 2.280946 0.510661 00:01
10 0.756243 2.429555 0.513021 00:01
11 0.699213 2.520038 0.525065 00:01
12 0.661544 2.569467 0.520671 00:01
13 0.639416 2.584662 0.521973 00:01
14 0.628166 2.575517 0.524089 00:01

Even when we add more layers, we get a worse accuracy than our single-layer RNN:

learn = Learner(dls, RNN4(len(vocab), 64, 5), 
                loss_func=CrossEntropyLossFlat(),
                metrics=accuracy, cbs=ModelResetter)
learn.fit_one_cycle(15, 3e-3)
epoch train_loss valid_loss accuracy time
0 2.996651 2.546187 0.407471 00:02
1 2.138067 1.668146 0.470866 00:01
2 1.629905 1.733835 0.497965 00:01
3 1.352087 1.925911 0.537679 00:01
4 1.158306 1.992638 0.541992 00:01
5 0.996187 2.063406 0.544922 00:01
6 0.872516 2.011257 0.568848 00:01
7 0.745992 1.745594 0.566569 00:01
8 0.609336 1.654734 0.593099 00:01
9 0.494758 1.707728 0.603271 00:01
10 0.406440 1.617404 0.607747 00:01
11 0.343562 1.717202 0.604167 00:01
12 0.304646 1.746353 0.599854 00:01
13 0.281399 1.724579 0.605794 00:01
14 0.268294 1.720216 0.603923 00:01

The reason is that we now have an even deeper model, which are more likely to lead to exploding or vanishing activations.

In practice, creating accurate models from multilayer RNNs are difficult because we're applying repeated matrix multiplication many, many times (each layer is another set of matrix multiplications). Multiplying by a number even a little greater than 1 will lead to exploding activations; and multiplying by a number even a little smaller than 1 will lead to vanishing activations.

We also have the problem of floating point numbers. Because of how they're stored on the computer, the numbers are more accurate the closer they are to 0. This inaccuracy leads to the vanishing gradients or exploding gradients problem, where in SGD, the weights are either not updated at all, or explode to infinity.

For RNNs, there're two types of layers that are commonly used to avoid exploding activations: gated recurrent units (GRUs) and long short-term memory (LSTM).

Long Short-Term Memory (LSTM)

LSTM introduces another hidden state called the cell state that retains important information that happened earlier in the sentence (e.g., the subject's gender to predict "he/she/they"), that is, long short-term memory. The other hidden state is then like the sensory short-term memory.

human memory
How human memory is theorized to work. Thank you IB psychology.

So, LSTM looks like this:

lstm
Diagram of LSTM. From left to right on the top, we have the forget gate, the input gate, the cell gate, and the output gate.

In essence, the blue box is our forward function, which uses the previous hidden states $h_{t-1}$ and $c_{t-1}$ and accepts an input batch $x_t$. The function updates the hidden states to yield $h_t$ and $c_t$, which become $h_{(t+1)-1}$ and $c_{(t+1)-1}$ for the next time step.

In LSTM, the hidden state $h_{t-1}$ and the input batch $x_t$ are concatenated instead of added like what we've been doing so far to create a tensor of size $h_{t-1}+x_t$. So, all the layers have an input size of $h_{t-1}+x_t$ and have an output size of $h_{t-1}$.

LSTM has four layers called gates. There's two different activation functions being used in LSTM: sigmoid (squishes to 0 to 1) and tanh (squishes to -1 to 1). From left to right:

  • (1) Forget gate $f_t$: take what you currently know ($h_{t-1}$) and apply that to the input ($x_t$) to forget unimportant things in the cell state $c_{t-1}$.
  • (2) Input gate $i_t$ and (3) cell gate $g_t$: these two gates work together, so I'll group them together and call them the remember gate. Basically, take what you currently know ($h_{t-1}$) and apply that to the input ($x_t$) to remember the important stuff from the cell gate $g_t$. Add the output from the remember gate to the cell state.
  • (4) Output gate $o_t$: take important things from the new cell state that we might need for the next time step $t$.

The importance mentioned above is what's learned when we train the model at each time step (i.e. epoch).

The cell state $c_t$ is able to remember stuff much better (maintain a longer-term state) than the hidden state $h_t$ since it doesn't go through a single layer, hence avoiding vanishing and exploding activations.

In code:

class LSTM(Module):
    def __init__(self, n_in, n_hid):
        n_cat            = n_in + n_hid
        self.forget_gate = nn.Linear(n_cat, n_hid)
        self.input_gate  = nn.Linear(n_cat, n_hid)
        self.cell_gate   = nn.Linear(n_cat, n_hid)
        self.output_gate = nn.Linear(n_cat, n_hid)
    
    def forward(self, x, state):
        h, c = state
        h    = torch.cat([h, x], dim=1)
        f    = torch.sigmoid(self.forget_gate(h))
        c    = c * f
        i    = torch.sigmoid(self.input_gate(h))
        g    = torch.tanh(self.cell_gate(h))
        c    = c + i * g
        o    = torch.sigmoid(self.output_gate(h))
        h    = o * torch.tanh(c)
        return h, (h, c)

However, in practice, we refactor the code since it's inefficient to do four small matrix multiplications when we can do one big multiplication on the GPU in parallel. It's like typing with a single finger when you were given 10 (unless you're missing fingers). Also, since it takes time to concatenate the input $x_t$ and the hidden state $h_t$, we have two layers instead: one for the input and one for the hidden state. So:

class LSTM(Module):
    def __init__(self, n_in, n_hid):
        self.i_h = nn.Linear(n_in,  4 * n_hid)
        self.h_h = nn.Linear(n_hid, 4 * n_hid)

    def forward(self, x, state):
        h, c    = state
        # .chunk(4, 1) splits the tensor into 4 tensors
        # along the first dimension 
        gates   = (self.i_h(x) + self.h_h(h).chunk(4, 1))
        # It doesn't matter what order the gates are 
        # as long as we keep the order throughout
        f, i, o = map(torch.sigmoid, gates[:3])
        g       = gates[3].tanh()

        c = c * f + i * g
        h = o * c.tanh()
        return h, (h, c)

And, our LSTM is essentially what we already have through PyTorch's nn.LSTM. So, we can recreate our multilayer RNN:

bs = 64
class LSTM(Module):
    def __init__(self, n_vocab, n_hidden, n_layers):
        self.i_h   = nn.Embedding(n_vocab, n_hidden)
        self.rnn   = nn.LSTM(n_hidden, n_hidden, n_layers, batch_first=True)
        self.h_o   = nn.Linear(n_hidden, n_vocab)
        # We have two hidden states (h, c) that we'll keep together in state
        self.state = [torch.zeros(n_layers, bs, n_hidden) for _ in range(2)]

    def forward(self, x):
        h, state   = self.rnn(self.i_h(x), self.state)
        self.state = [s.detach() for s in state] 
        return self.h_o(h)

    def reset(self):
        for s in self.state: s.zero_()
learn = Learner(dls, LSTM(len(vocab), 64, 2), 
                loss_func=CrossEntropyLossFlat(),
                metrics=accuracy, cbs=ModelResetter)
learn.fit_one_cycle(15, 1e-2)
epoch train_loss valid_loss accuracy time
0 3.056956 2.708244 0.341553 00:01
1 2.185629 1.798441 0.377360 00:01
2 1.627687 1.897092 0.459717 00:01
3 1.285119 2.024220 0.486979 00:01
4 1.003079 1.560079 0.567301 00:01
5 0.718694 1.565812 0.619303 00:01
6 0.469903 1.525995 0.648519 00:01
7 0.306789 1.473051 0.684814 00:01
8 0.199153 1.369004 0.728516 00:01
9 0.122797 1.450004 0.731364 00:01
10 0.075972 1.254726 0.758545 00:01
11 0.048235 1.323799 0.759684 00:01
12 0.033595 1.288537 0.757080 00:01
13 0.026236 1.295729 0.759359 00:01
14 0.022860 1.290811 0.757406 00:01

Although we reduced the chances of vanishing or exploding gradients, now we have a bit of overfitting. Although there aren't many data augmentation techniques for text (like translating to another language and then back to the original), we can apply regularization techniques like dropout, activation regularization, and temporal activation regularization.

Averaged SGD (ASGD) Weight-Dropped LSTM (AWD-LSTM)

For AWD-LSTM, we need to include 4 (5) things:

  1. Dropout: randomly (through a Bernoulli trial) remove some activations with probability $p$.
  2. Activation regularization: weight decay, but with activations instead of weights.
  3. Temporal activation regularization: activation regularization, but with the difference between two consecutive activations.
  4. Weight tying: tying the hidden to output weights with the input to hidden weights.
  5. (You also use non-monotically triggered average stochastic gradient descent (NT-ASGD) as the optimizer).

Dropout is where you randomly set some of the activations to zero during training to make sure all parameters are being useful in producing the output:

Dropout image
A neural network with 2 hidden layers. (a) Before dropout. (b) After dropout.

But, we can't just zero some activations without doing something else since we won't have the same scale when we take the sum of 5 activations compared to 2 activations. So, if we have $n$ activations and apply dropout with probability $p$, then we'll have on average $(1-p)n$ activations left. Finally, we can divide them by $1-p$ to rescale the remaining to $n$, which effectively applies dropout while maintaining the scale as if we still had all activations (making dropout act like an identity function).

The PyTorch implementation of dropout is as follows:

class Dropout(Module):
    def __init__(self, p):
        self.p = p
    
    def forward(self, x):
        # Only apply dropout during training
        if self.training:
            # Creates a mask with 1s at a probability of (1-p)
            # and 0s at a probability of p
            mask = x.new(*x.shape).bernoulli_(1 - self.p)
            # Divide the mask in place by (1-p) and multiply with x
            return x * mask.div_(1 - self.p)
        # Don't apply dropout during inference
        else:
            return x

We apply dropout before passing the outputs of our LSTM layer to the final output layer.

To change the training attribute of a PyTorch Module, you can use the train method to set it to true and the eval method to set it to false. When you call these methods, it sets the training attribute for that Module and recursively applies it to the next Modules. You won't see it here often since it's applied automatically by fastai's Learner class.

Activation regularization (AR) and temporal activation regularization (TAR) are essentially weight decay, but with activations. With weight decay, we add a penalty to the loss (but in practice, we add to the gradient) to make the weights as small as possible to avoid overfitting (by making the loss have less steep points). For AR and TAR, we aim to make the final LSTM activations as small as possible.

With AR, we can do the following to the loss:

loss += alpha * activations.pow(2).mean()

But, we know from weight decay that it'll be more efficient to add them to the gradient instead of the loss:

grad += alpha * activations.mean()

Then, going straight to the gradient for TAR, we have:

grad += beta * (activation[:,1:] - activations[:,:-1]).mean()

We have two new hyperparameters that we can tune for AR and TAR: alpha and beta like how we could adjust wd for weight decay. To apply AR and TAR, we use the RNNRegularizer callback (although that class adds the square to the loss).

But, to make AR and TAR work, we need our new model to return three things: (1) the actual output, (2) the LSTM activations pre-dropout and (3) the LSTM activations post-dropout.

We apply AR on the post-dropout LSTM activations to not penalize the activations we dropped; and, we apply TAR on the pre-dropout LSTM activations because those dropped activations make a big difference between two consecutive time steps.

Finally, we have weight tying. Weight tying is used in language models because we go from our input vocab to some hidden state, then from the hidden state to our output, which are tokens from the same vocab. So, we can expect that the mappings from input to hidden will be the same for the mapping from hidden to output; that is, the mapping is invertible (or at least, try to enforce it to be invertible). Therefore, we can set the weights of the hidden to output layer to be equal to the weights of the input to hidden layer:

self.h_o.weight = self.i_h.weight

So, we now have our final model:

class AWDLSTM(Module):
    def __init__(self, n_vocab, n_hidden, n_layers, p):
        # What we had before in LSTM
        self.i_h  = nn.Embedding(n_vocab, n_hidden)
        self.rnn  = nn.LSTM(n_hidden, n_hidden, n_layers, batch_first=True)
        self.h_o  = nn.Linear(n_hidden, n_vocab)
        self.h = [torch.zeros(n_layers, bs, n_hidden) for _ in range(2)]

        # Dropout layer
        self.drop = nn.Dropout(p)
        
        # Weight tying
        self.h_o.weight = self.i_h.weight

    def forward(self, x):
        h, state = self.rnn(self.i_h(x), self.h)
        h_drop   = self.drop(h)
        self.h   = [s.detach() for s in state]
        return self.h_o(h_drop), h, h_drop 

    def reset(self):
        for h in self.h: h.zero_()

Then, to train this model, we have:

learn = Learner(dls, AWDLSTM(len(vocab), 64, 2, 0.5), 
                loss_func=CrossEntropyLossFlat(), metrics=accuracy,
                cbs=[ModelResetter, RNNRegularizer(alpha=2, beta=1)])

But, since we use those callbacks so often, we can instead use TextLearner which applies ModelResetter and RNNRegularizer (with alpha=2, beta=1 as defaults):

learn = TextLearner(dls, AWDLSTM(len(vocab), 64, 2, 0.5),
                    loss_func=CrossEntropyLossFlat(), metrics=accuracy)

Finally, when we train our model, we can also add weight decay for additional regularization:

learn.fit_one_cycle(15, 1e-2, wd=0.1)
epoch train_loss valid_loss accuracy time
0 2.681149 2.038707 0.457031 00:01
1 1.733276 1.261145 0.619141 00:01
2 1.007044 0.856758 0.762370 00:01
3 0.523582 0.975427 0.793538 00:01
4 0.276582 0.704902 0.820557 00:01
5 0.151957 0.593644 0.867025 00:01
6 0.093755 0.536299 0.866455 00:01
7 0.061734 0.540390 0.870524 00:01
8 0.043401 0.519793 0.879313 00:01
9 0.034196 0.426708 0.882650 00:01
10 0.027788 0.494268 0.883626 00:01
11 0.023066 0.439835 0.886068 00:01
12 0.019279 0.433105 0.880697 00:01
13 0.017327 0.435248 0.886393 00:01
14 0.015784 0.425923 0.887614 00:01

And we've come a long ways from 49% accuracy with a single layer vanilla RNN.

Conclusion

So, a recurrent neural network is just a neural network that has some layers used repeatedly such that we can put them in a loop. A vanilla RNN is fairly difficult to get a good accuracy, and, when we attempt to do a vanilla multilayer RNN, it becomes even harder to get a good accuracy because of exploding and vanishing gradients. That's why we now have LSTM (but we can also use GRU, which has only one hidden state that splits during the time step into the hidden and cell states). However, LSTM has an issue of overfitting. So, what do we do when we overfit? We apply data augmentation techniques (since we might not have enough data), but there aren't many cheap and quick data augmentation techniques for text. Instead, we opt for regularization techniques like dropout, activation regularization, temporal activation regularization, and weight tying. Applying these regularization techniques creates a new kind of architecture that we could call a rudimentary AWD-LSTM.

For an actual AWD-LSTM, we have to apply dropout in a few more places:

  • Embedding dropout: inside the embeddings, drop some random rows of embeddings.
  • Input dropout: applied after the embedding layer.
  • Weight dropout: appled to the weights of LSTM after each epoch.
  • Hidden dropout: applied to the hidden state between two layers.

These additional regularizations (and averaged SGD) completes AWD-LSTM, where AWD-LSTM uses 5 different kinds of dropout (the 5th is the one where we drop some activations after LSTM). There are already good defaults set in place in fastai's implementation of AWD-LSTM that we used in this blog and we were able to adjust the magnitude of the dropouts with the drop_mult parameter.