Pixel Similarity vs. Basic Neural Net on the MNIST data set
Comparing two digit classifiers: a "similarity" approach and a 3 layer neural network.
path = untar_data(URLs.MNIST)
The dataset is separated into training and testing subfolders, where in those folders, there are separate folders for each digit:
Path.BASE_PATH = path
path.ls(),(path/'training').ls()
We'll store the path of each image in an array, where the i
th row contains the path for the i
th digit:
nums = [(path/'training'/f'{x}').ls().sorted() for x in range(10)]
im3_path = nums[3][0]
im3 = Image.open(im3_path)
im3
Then, we'll open the images, put every image of the same digit into their own tensor and store them as a list of tensors:
nums_tens = [torch.stack([tensor(Image.open(j)) for j in nums[i]]) for i in range(10)]
nums_tens = [nums_tens[i].float()/255 for i in range(10)]
We can then take the mean of one of the tensors to get its "perfect" version. Here is how it looks like for a 3
:
stacked_threes = nums_tens[3].mean(0)
show_image(stacked_threes)
And to compare, here is just one of those threes:
a_3 = nums_tens[3][0]
show_image(a_3)
Next, we'll create a function that compares two tensors through absolute mean difference:
def mnist_distance(x1, x2):
return (x1 - x2).abs().mean((-1, -2))
Now we can compare one of the threes with its "perfect" version. The number doesn't really mean anything until we compare it with another number:
mnist_distance(a_3, stacked_threes)
So, we'll take the average seven and take the L1 norm (absolute mean difference) and compare that number with the number we just got (0.1074
)
stacked_sevens = nums_tens[7].mean(0)
show_image(stacked_sevens)
mnist_distance(a_3, stacked_sevens)
As you can see, the distance between the 3
and the average 3
is smaller than the distance between the 3
and the average 7
. So, it is more three than seven. We'll extend this approach by comparing an image with the average for each digit and say it is the digit it is the most similar to (its L1 norm with that average digit is the smallest).
We'll create the average number for each digit:
stacked_nums = [nums_tens[i].mean(0) for i in range(10)]
show_image(stacked_nums[4])
We can compare our 3
to each average digit:
L(mnist_distance(a_3, stacked_nums[i]) for i in range(10))
As you can see, it is most similar to the average three.
Now we'll import the validation set and put them into a list of tensors:
valid_nums = [(path/'testing'/f'{i}').ls().sorted() for i in range(10)]
valid_nums_tens = [torch.stack([tensor(Image.open(j)) for j in valid_nums[i]]) for i in range(10)]
valid_nums_tens = [valid_nums_tens[i].float()/255 for i in range(10)]
We'll create a function that returns the accuracy of our whole process:
def is_num(x1, x2s, x):
# Get the distance between the number and the average digit for each digit
vals = [mnist_distance(x1, x2s[i]) for i in range(10)]
# Turn the tensors into floats so that we can perform the `min` function
vals_2 = [[vals[i][j].item() for i in range(10)] for j in range(len(x1))]
# Get a list of tensors that contain a bool value, where it's true when
# the minimum distance is equal to the digit the given number is supposed
# to be
vals_3 = [tensor(vals_2[i].index(min(vals_2[i])) == x) for i in range(len(x1))]
# Return how often our model is correct
return tensor(vals_3).float().mean(0)
nums_accuracy = tensor([is_num(valid_nums_tens[i], stacked_nums, i) for i in range(10)])
nums_accuracy, nums_accuracy.mean(0)
Our model has an overall accuracy of 66.1%
! Better than a random guess of 10%
, but certainly not good. It is particularly good at guessing if a number is a 1
, but particularly bad for 2
s, 5
s and 8
s.
Now that we have a baseline, we can try how good we can get a simple model "from scratch."
For our "from scratch" learner, we'll have 2 layers, where each layer contains a linear layer and a ReLU (rectified linear unit, where all negative numbers become 0
).
For our loss function, we will be using cross-entropy loss since we have multiple categories.
First, we'll make our training and validation datasets and dataloaders. Then, we'll initialize parameters, figure out how to make predictions, calculate the loss (cross-entropy), calculate the gradients, and then step (using the provided SGD optimizer).
Let's first redownload the MNIST dataset:
path = untar_data(URLs.MNIST)
Path.BASE_PATH = path
path.ls()
Then, we'll save the training and validation images into separate variables:
nums = [(path/'training'/f'{x}').ls().sorted() for x in range(10)]
nums_tens = [torch.stack([tensor(Image.open(j)) for j in nums[i]]) for i in range(10)]
nums_tens = [nums_tens[i].float()/255 for i in range(10)]
valid_nums = [(path/'testing'/f'{i}').ls().sorted() for i in range(10)]
valid_nums_tens = [torch.stack([tensor(Image.open(j)) for j in valid_nums[i]]) for i in range(10)]
valid_nums_tens = [valid_nums_tens[i].float()/255 for i in range(10)]
Next, we'll create our dataset from our training set. A dataset is a list of tuples, which contains the independent variable and its label (dependent variable) like so: (independent, dependent)
.
train_x = torch.cat(nums_tens).view(-1, 28*28)
It took a while for me to realize what the .view()
function was doing, but what it does is pretty simple. We give it however many values we want (that makes sense) to change the shape of our tensor. Here we give it -1, 28*28
which will turn our rank-3 tensor (n
-images of 28 by 28) into a rank-2 tensor (n
-images of 28*28). -1
makes it so that we don't have to specify how many images there are and 28*28
means we want to compress our previous 28
by 28
grid into a 28*28
vector. It's like turning a 2D array into a 1D array:
# -1 makes it so that we don't have to know how many images there are
nums_tens[0].view(-1, 28*28).shape, nums_tens[0].view(5923, 28*28).shape
# before we called .view(), our tensor was originally 28x28, but afterwards, it is 28*28 (784)
nums_tens[0].size(), nums_tens[0].view(-1, 28*28).shape
We'll form our labels by having as many tensors containing the digit's digit as there are of that digit:
train_y = torch.cat([tensor([i] * len(nums_tens[i])) for i in range(10)])
train_x.shape, train_y.shape
# when we take a random 3, we can index into the labels at the same spot and see
# that we can 3 as its label
show_image(nums_tens[3][200]), train_y[len(nums_tens[0]) + len(nums_tens[1]) + len(nums_tens[2]) +200]
Like I said before, a dataset is just a list of tuples containing our independent and dependent variables:
dset = list(zip(train_x, train_y))
x, y = dset[0]
x.shape, y
And we can see that given a label 0
, our image is indeed a zero:
# we have to reshape our image from a 784 long vector into a 28*28 matrix
show_image(x.view(28, 28))
Now we'll make the dataset for our validation set:
valid_x = torch.cat(valid_nums_tens).view(-1, 28*28)
valid_y = torch.cat([tensor([i] * len(valid_nums_tens[i])) for i in range(10)])
valid_dset = list(zip(valid_x, valid_y))
Next, we'll create DataLoader
s for our training and validation sets. A DataLoader
takes a dataset and each time we use it, it will give a portion of the dataset. We can then work on a portion of the dataset instead of just 1 tuple or the entire set. We can also toggle whether we want our given portion to be randomize (we wouldn't want to get all 0
s, then 1
s, then 2
s, ... we want a mix):
dl = DataLoader(dset, batch_size = 128, shuffle = True)
valid_dl = DataLoader(valid_dset, batch_size = 128, shuffle = True)
xb, yb = first(dl)
xb.shape, yb.shape
Then, we'll create our DataLoaders
. A DataLoaders
is like the dataset of a DataLoader
: it just contains our training and validation DataLoader
s:
dls = DataLoaders(dl, valid_dl)
Our simple neural network uses PyTorch's nn.Sequential
which takes modules and uses the GPU to handle the operations:
simple_net = nn.Sequential(
# Our first layer takes in 28*28 inputs and outputs 250
nn.Linear(28 * 28, 250),
nn.ReLU(),
# Our second layer takes in 250 inputs and outputs 50
nn.Linear(250, 50),
nn.ReLU(),
# Our final layer takes in 50 inputs and outputs 10
# (its confidence for our image to be each digit)
nn.Linear(50, 10)
)
We use cross-entropy loss so that we can turn our 10 outputs into numbers that are from 0 to 1 and sum to 1 like probabilities (through softmax). But, that's just the first part. We then take the negative log (-log(p)
) of those probabilities to give emphasis on the higher probabilities.
We'll use the given Learner
class from fastai (which handles epochs) with the SGD
optimizer (stochastic gradient descent, which handles calculating gradients and stepping into lower loss) and use the accuracy metric (the number we care about).
learn = Learner(dls, simple_net, opt_func = SGD, loss_func = F.cross_entropy, metrics = accuracy)
We'll use the learning rate finder to select a good learning rate for us:
lrs = learn.lr_find()
lrs
learn.fit(20, lr = lrs.valley)
And we see that our final accuracy is 97.5%
! Certainly better than the 66.1%
we got from our pixel similarity approach.
To compare, here's the results using fastai's provided cnn_learner
which uses a pretrained model with 18 layers:
dls2 = ImageDataLoaders.from_folder(path, train='training', valid='testing')
learn2 = cnn_learner(dls2, resnet18, loss_func=F.cross_entropy, metrics=accuracy)
learn2.fit_one_cycle(1, 0.1)
Our model is not bad considering it's less than 1%
in accuracy different from a pretrained model.
We could even make our model better by training for more epochs until the validation loss becomes worse or by using a deeper model.