Adding deep learning to tabular models
Ensembling just got a whole lot more interesting.
Two blogs on the same day! I didn't want the last one to be too long and I didn't think I would be this productive to be able to finish the neural network part today. But, here we go.
Previously, I trained a random forest model for predicting sale price on past auction data for bulldozers. Today, I'm going to be training a deep learning model on the same data set (with some changes based on the last blog). Then, I'll try using the embeddings learned from that model and train another random forest model and aim for an even better validation RMSE. Finally, I'll ensemble the results of the deep learning model and the new random forest model and see how much better the predictions become.
So, we'll first read our data into a pandas DataFrame
,
df = pd.read_csv('TrainAndValid.csv', low_memory=False)
Then, we apply the initial transforms we did last time:
- Set an order for the product sizes.
- Log the sale price.
- Split the date column into metacolumns.
sizes = 'Large','Large / Medium','Medium','Small','Mini','Compact'
df['ProductSize'] = df['ProductSize'].astype('category')
df['ProductSize'] = df['ProductSize'].cat.set_categories(sizes, ordered=True)
df['SalePrice'] = np.log(df['SalePrice'])
df = add_datepart(df, 'saledate')
And we remove all the unneeded columns we determined last time:
to_keep_df = load_pickle('xs_new.pkl')
to_keep = list(to_keep_df) + ['SalePrice']
df = df[to_keep]
df.head(3)
Next, we have to know which columns should be treated as categorical so that they can be given embeddings. For that, we'll use cont_cat_split
, but we set the max cardinality as 9000 (embedding sizes greater than 10,000 should be used after you tested if there's better ways to group the variable).
cont, cat = cont_cat_split(df, max_card=9_000, dep_var='SalePrice')
Mainly, we want to ensure saleElapsed
is in the continuous section since by definition, if a label is in the categorical section, it cannot be extrapolated.
cont
Next, let's see the cardinality of each of the categorical variables (if there's some that are similar in number, they may be redundant and we can try to remove all but one):
df[cat].nunique()
It appears that ModelID
and fiModelDesc
may be redundant since they're both pertaining to Model
and are of similar cardinality. So, we'll do what we did before and try removing fiModelDesc
and see what happens to the results.
get_oob(xs_new)
{c: get_oob(xs_new.drop(c, axis=1)) for c in
['ModelID', 'fiModelDesc', 'fiModelDescriptor']}
Overall, it seems we can remove fiModelDescriptor
without it significantly affecting the model.
cat.remove('fiModelDescriptor')
To create DataLoaders
for our model, we can use TabularPandas
again. However, we have to add the Normalize
TabularProc
for a neural network since the scale of the variables matter unlike in building a decision tree.
tp = TabularPandas(df, [Categorify, FillMissing, Normalize],
cat, cont, splits=splits, y_names='SalePrice')
# a vision model since we're dealing with
# dabular data (doesn't require as much GPU RAM)
dls = tp.dataloaders(1024)
Next, we'll see what range we should have for our predictions.
y, v_y = tp.train.y, tp.valid.y
y.min(),y.max(),v_y.min(),v_y.max()
So, we can set our y_range
as (8, 12) (remember that a model tends to do better when we have our upper bound a little higher than the actual maximum).
learn = tabular_learner(dls, layers=[500, 250],
y_range=(8, 12), n_out=1,
loss_func=F.mse_loss)
As always, we'll use .lr_find()
to find the optimal learning rate:
lr = learn.lr_find().valley
And, we'll use fit_one_cycle
to train our model since we're not transfer learning:
learn.fit_one_cycle(5, lr)
preds, targs = learn.get_preds()
r_mse(preds, targs)
Overall, we got a better RMSE than the random forest model, but not by a lot.
Now, we'll try using the embeddings from the neural network to replace the categorical columns for our random forest:
# code is essentially verbatim from fast.ai's @danielwbn:
# https://forums.fast.ai/t/using-embedding-from-the-neural-network-in-random-forests/80063/10
def transfer_embeds(learn, xs):
xs = xs.copy()
for i, feature in enumerate(learn.dls.cat_names):
emb = learn.embeds[i].cpu()
# added .cpu() to learn since tensor below is made on cpu while learn is on cuda
new_feat = pd.DataFrame(emb(tensor(xs[feature], dtype=torch.int64)),
index=xs.index,
columns=[f'{feature}_{j}' for j in range(emb.embedding_dim)])
xs = xs.drop(feature, axis=1)
xs = xs.join(new_feat)
return xs
xs_with_embs = transfer_embeds(learn, learn.dls.train.xs)
valid_xs_with_embs = transfer_embeds(learn, learn.dls.valid.xs)
m = rf(xs_with_embs, train_y)
m_rmse(m, xs_with_embs, train_y), m_rmse(m, valid_xs_with_embs, valid_y)
The validation RMSE ends up becoming worse! But, take a look at this:
We'll look at our RMSE when we ensemble the predictions from random forests (with embeddings) and the neural network. However, before we can take the average of the predictions from random forests and the neural network, we have to make them of the same type. To do so, we have to turn our neural network predictions (a rank-2 tensor) into a rank-1 numpy array. We can apply .squeeze()
to remove any unit axis (a vector with only 1 thing in it)
rf_preds = m.predict(valid_xs_with_embs)
ens_preds = (to_np(preds.squeeze()) + rf_preds) / 2
r_mse(ens_preds, valid_y)
And, we now have an RMSE that's lower than the top leaderboard score of 0.22909, granted we're doing it on the validation set (since we don't have access to the SalePrice
of Test.csv
):
a = pd.read_csv('Test.csv', low_memory=False)
a.columns, 'SalePrice' in a.columns
So, we've finally covered tabular data training with decision trees, random forests, neural networks, and random forests with embeddings. We've ensembled the results of the last 2 to get a score that ("technically") beats the top leaderboard score of the Kaggle competition.
Next, we'll be moving onto natural language models. But, I might also try experimenting with another tabular data set.