How to train your NLP model (ULMFit)

With NLP (natural language processing), we can use transfer learning to train an NLP model using a pretrained language model. By using a pretrained language model, you spend less time, data, and money. But, unlike computer vision, we don't need a model that was trained on similar data.

A language model is a model that was trained to predict the next word given a sequence of text. We train these models through self-supervised learning, where we don't give any labels, just a lot of text. The model can automatically (thus self-supervised) create the labels from the text and be trained on it to develop an understanding of the text's language.

So, you have a pretrained language model, but it isn't the best idea to train an NLP model right away. The language model probably doesn't know the vocabulary and style (like grammar, formality, etc.) of your problem domain. So, you first fine-tune the model on the corpus of your problem domain, then fine-tune that new model to train your NLP model.

With this method, we can fine-tune the language model on both the text from the training and validation (and maybe test) sets which will make the new language model very good at predicting the next word for your problem domain.

This process of having a language model, fine-tuning it on your corpus, and then fine-tuning that for an NLP model is called the Universal Language Model Fine-tuning (ULMFit) approach.

Training your NLP (classifier) model

In training an NLP model, we first have to fine-tune our language model. To do so, we need to preprocess the text such that it's ready to be put into a model.

Preprocessing the text

Text is a categorical variable. And, to put it inside of a neural network, we have to assign them an embedding matrix. This process is the first layer of the neural network: turning a cateogrical variable continuous through an embedding matrix assignment.

With regular categorical variables, we:

Make a list of all possible levels of that categorical variable (which we call the vocab).
Replace each level with its index in the vocab.
Make an embedding matrix for this categorical variable where each row corresponds to a level.
Use this embedding matrix as the first layer of a neural network. This layer takes the index $i$ created in step 2 and returns the $i$-th row in the matrix.

For text, the first step is a little different since there're many ways we can define levels for text. It also works differently for different languages. This process is called tokenization, where each item in the vocab is called a token. The second step, where we assign numbers to each token is called numericalization.

When we use a pretrained model, our new vocab will contain words that were in the old model, but also some that weren't. We'll keep the corresponding row in the embedding matrix for words that exist in the pretrained model and initialize a random vector in rows corresponding to new words.

In tokenizing a given corpus, there are three main methods:

Words: Split a sentence at every space and puntuation (like apostrophes) to create words and contractions.
Subwords: Split words into subwords based on the most commonly occurring substrings.
Characters: Split a sentence into characters.

When do we use which? Word tokenizers assume that spaces are special separators in a sentence. While this is usually correct for English, other languages like Chinese and Japanese that don't really have spaces are better off with subword and character tokenizers. And, when spaces are special, but the languages uses many subwords like in Turkish and Hungarian, it would be better to use subword tokenizers than word tokenizers. Lastly, when a language has many characters (unlike 26 in English) like Chinese, it may be better to use character tokenizers.

You want to be careful to not have too many items in your vocab. For subword, you have the positive that there'll be fewer tokens in each sentence, and thus have faster training, less memory, and less state for the model to remember. But in general, a larger vocab leads to larger embedding matrices, which require more data to learn, take longer and require more GPU memory to train.

Once we have our vocab, we can convert every token in the corpus into a number that represents its index in the vocab.

Then, we have to make our independent and dependent variables for our DataSet object (which is just a wrapper class for a tuple (independent, dependent)).

For a language model, we want it to be able to predict the next word in a sequence of words. So, given a sequence of words, we want our independent variable to be from the first word of the sequence to the second last word. Then, our dependent variable will be from the second word of the sequence to the last word.

We'd also be dividing the text into small pieces while maintaining order (otherwise our model would just predict random words instead of the next word in the sequence).

At every epoch, we shuffle our collection of documents and concatenate them into a stream of tokens. Then, we cut that stream into a batch of fixed-size consecutive mini-streams. The model then reads the mini-streams in order and learns to predict the next word for each independent variable.

Unlike with images, the key thing in NLP is that we randomize the documents (blocks of text) but we always have to maintain order of the words in each document.

Fine-tuning the language model

When we're fine-tuning the pretrained language model, we use a recurrent neural network (unlike convolutional neural network for vision) and use the AWD-LSTM architecture. For our loss function, we use cross-entropy loss since (almost all) NLP problems are classification problems where the different categories are the words in the vocab. Finally, for our metrics we'll use accuracy (since cross-entropy is difficult to interpret and speaks more for the confidence of our model than its accuracy) and perplexity (which is the exponent of the loss).

If we don't want to train a text classifier and instead just wanted a language model, we stop here and end up with a text generator. If you add some randomness (so you don't get the same prediction twice), you can generate many different kinds of text given the first few words.

Otherwise, you'll use this fine-tuned language model to train a text classifier.

Fine-tuning the text classifier

Unlike with a language model, when making our DataLoaders, our independent variable will be the text, while our dependent variable must be supplied. And, when trying to make a mini-batch, the tensors have to be the same shape. So, we sort the text by token length before each epoch and for every mini-batch, pad every text to be the same token length as the text with the largest token length in the mini-batch. By "pad", we add a special padding token that'll be ignored by the model.

Then, we fine-tune our fine-tuned language model by training it with discriminative learning rates and gradual unfreezing.

In the end, you have a language model that can generate text related to your problem domain and a text classifier that can classify text in your problem domain with certain labels.

Conclusion

With NLP, there's a lot of fine-tuning. By ULMFit, you start with a pretrained language model that could have been trained on a really big data set like Wikipedia, you fine-tune it on your own text to have a language model that can generate text really well for your problem domain, then you fine-tune that language model to train a text classifier.