Predicting the Sentiment of Movie Reviews with NLP
By fine-tuning a language model on movie reviews, can we predict if they're positive or negative?
If you didn't know, I work on Colab; I don't own a GPU that I can use to train models on my own. Whenever I download data sets, I end up having to redownload them when I come back on the next day. However, I found out how to download the data sets into my Google drive so that I don't have to redownload them every time. It's also helpful since I can save the pickles to the same folder instead of saving them on my laptop and having to upload them later.
It's really simple too: you mount onto your drive and then cd
to your desired folder. When you download data sets, you download them into that folder. But, if you use untar_data
, you have to move the files to the folder since it only saves to ~/.fastai/...
.
!mkdir -p '/content/gdrive/MyDrive/data_sets/'
%cd '/content/gdrive/MyDrive/data_sets/'
There's also a difference between !
and %
in Colab: !
is like a "no side-effect" way to do command-line commands. !mkdir ...
will make a directory, and !cd ...
will change directories, except it doesn't, because that would be a side effect. So, to actually change the current directory that Colab is in, you have to use %
. You can think of %
as being able to change the variables stored in Colab; hypothetically %cd path
makes Colab's CURRENT_PATH = path
.
from fastai.text.all import *
path = untar_data(URLs.IMDB_SAMPLE)
!mv {str(path)} '/content/gdrive/MyDrive/data_sets/imdb_sample'
path = Path('/content/gdrive/MyDrive/data_sets/imdb_sample')
Since we're using the sample IMDB data set, we don't have separate folders, but instead have a single .csv
file with 1000 reviews. So, we'll be using TextBlock.from_df
instead of TextBlock.from_text_files
to tell fastai how to get our data:
df = pd.read_csv(path/'texts.csv', low_memory=False); df.head(3)
dls_lm =
DataBlock(
blocks = TextBlock.from_df(['text'], is_lm=True),
get_items = ColReader('text'),
splitter = RandomSplitter(0.1)).dataloaders(df, path=path)
When we show a batch, you can see how the independent variable contains one more token before the dependent variable, but has one less token at the end. Fastai also has special tokens that it makes through TextBlock
(which uses the Tokenizer
class):
xxbos
: beginning of stream.xxeos
: end of stream.xxmaj
: next token is capitalized.xxunk
: unknown (we ignore tokens that don't appear enough to minimize the size of the embedding matrix).xxup
: next token is all uppercase.xxrep
followed by a numbern
: the next actual token repeatsn
times.
dls_lm.show_batch(max_n=2)
In general, Tokenizer
does a lot of extra things on top of tokenizing the texts:
defaults.text_proc_rules
Where
fix_html
: replace special HTML characters with a readable version.replace_rep
: replace characters repeated three or more times withxxrep n c
wherexxrep
is the special token,n
is the number of times it repeats, andc
is the character.replace_wrep
: replace words repeated three or more times withxxwrep n w
wherew
is the word.spec_add_spaces
: add spaces around special characters like/
and#
so they get split into separate tokens.rm_useless_spaces
: remove repeated spaces.replace_all_caps
: lowercase a word in all caps and add axxup
before it.replace_maj
: lowercase a capitalized word and add axxmaj
before it.lowercase
: lowercase all words and add axxbox
to the beginning and/orxxeos
to the end of the string.
And here's most of it in action:
tok = WordTokenizer()
tkn = Tokenizer(tok)
list(tkn('amp; © #a" <br /> www.google.ca/INDEX.html wow wow wow'))
We'll be fine-tuning a language model pretrained on Wikepedia pages. This model uses a recurrent neural network using the AWD-LSTM architecture.
learn = language_model_learner(dls_lm, AWD_LSTM, drop_mult=0.3, metrics=[accuracy, Perplexity()])
Since pretrained
is set to True
for our language_model_learner
by default, all the layers except the head are frozen. So, we'll first train the head of the model for one epoch:
lr = learn.lr_find().valley
learn.fit_one_cycle(1, lr)
Then, we'll unfreeze all the layers and train all the layers for another epoch:
lr = learn.lr_find().valley
learn.unfreeze()
learn.fit_one_cycle(1, lr)
We'll save the model as an encoder, which is the body of the model (the model not including the task-specific final layer(s)):
learn.save_encoder('lm')
If we were to stop here, we'd have a text generator (which is what a language model is). We provide a piece of text and how many words we want it to predict:
learn.predict("I liked the book better", 25, temperature=0.75)
Unlike language models, which are self-supervised, we have to provide our DataLoaders
for our classifier with the labels:
dls_c =
DataBlock(
blocks = (TextBlock.from_df('text', vocab=dls_lm.vocab), CategoryBlock),
get_x = ColReader('text'),
get_y = ColReader('label'),
splitter = ColSplitter()
).dataloaders(df, path=path)
This time, we have to specify our blocks
differently so that our independent variable is a TextBlock
and our dependent variable is a CategoryBlock
. So, when show a part of a batch, it'll look different from a language model's DataLoaders
:
dls_c.show_batch(max_n=2)
Next, we'll define a Learner
by using the text_classifier_learner
class:
learn = text_classifier_learner(dls_c, AWD_LSTM, drop_mult=0.5, metrics=accuracy)
And load the encoder:
learn = learn.load_encoder('lm')
When fine-tuning a language model for a text classifier, it's recommended to use discriminative learning rates and gradual unfreezing. So, we'll first fine-tune it for 1 epoch:
learn.fit_one_cycle(1, 2e-2)
Then unfreeze the last 2 layers and begin using discriminative learning rates:
den = 2.6**4
learn.freeze_to(-2)
learn.fit_one_cycle(1, slice(1e-2/den, 1e-2))
Then unfreeze up to the last 3 layers:
learn.freeze_to(-3)
learn.fit_one_cycle(1, slice(5e-3/den, 5e-3))
Next unfreeze the entire model:
learn.unfreeze()
learn.fit_one_cycle(2, slice(1e-3/den, 1e-3))
And finally train for a few more epochs:
learn.fit_one_cycle(5, 1e-3)
And we're done! We can try predicting the sentiment of a few Morbius reviews:
[learn.predict(i) for i in [
"So many choices... just seem halfhearted. They're not terrible, but they're also not interesting.",
"Doesn't deliver a lot of high notes, but it's not unwatchable (especially if you enjoy the vampire genre). \"Morbius\" is not as great as you'd hoped, but not as bad as you feared.",
"Overall, this is an entertaining movie. It sucks you in with the action and standout scenes between some of the characters. But, while Morbius has the makings of a great anti-hero revival, it wavers with the execution.",
"Like the Venom films and unlike the MCU movies Morbius eschews the mythmaking of the Avengers for darkly comic horror.",
"Sadly, director Daniel Espinosa's action horror is itself drained of atmosphere, shocks and drama."
]]
In this blog, I applied the ULMFiT approach to NLP. I fine-tuned the pretrained language model that was trained on Wikepedia pages with the smaller version of the IMDB data set to create a language model that can generate text relevant to movie reviews. Then, I used the encoder from that model to train a movie review sentiment classifier with an accuracy of 83.5%. For a model that was only trained with 1000 reviews, it's not that bad!
For next steps, I'd like to train a model using the full IMDB data set, but I don't have the necessary hardware to do it in a reasonable time. When I tried training a Learner
on the full data set, the estimated time for the first epoch was two and a half hours. Maybe in the future when I pay for a GPU server or own my own setup.