Two blogs on the same day! I didn't want the last one to be too long and I didn't think I would be this productive to be able to finish the neural network part today. But, here we go.


Previously, I trained a random forest model for predicting sale price on past auction data for bulldozers. Today, I'm going to be training a deep learning model on the same data set (with some changes based on the last blog). Then, I'll try using the embeddings learned from that model and train another random forest model and aim for an even better validation RMSE. Finally, I'll ensemble the results of the deep learning model and the new random forest model and see how much better the predictions become.

So, we'll first read our data into a pandas DataFrame,

df = pd.read_csv('TrainAndValid.csv', low_memory=False)

Then, we apply the initial transforms we did last time:

  • Set an order for the product sizes.
  • Log the sale price.
  • Split the date column into metacolumns.
sizes = 'Large','Large / Medium','Medium','Small','Mini','Compact'
df['ProductSize'] = df['ProductSize'].astype('category')
df['ProductSize'] = df['ProductSize'].cat.set_categories(sizes, ordered=True)
df['SalePrice'] = np.log(df['SalePrice'])
df = add_datepart(df, 'saledate')

And we remove all the unneeded columns we determined last time:

to_keep_df = load_pickle('xs_new.pkl')
to_keep = list(to_keep_df) + ['SalePrice']
df = df[to_keep]
df.head(3)
YearMade ProductSize Coupler_System fiProductClassDesc ... Drive_System Hydraulics Tire_Size SalePrice
0 2004 NaN NaN Wheel Loader - 110.0 to 120.0 Horsepower ... NaN 2 Valve None or Unspecified 11.097410
1 1996 Medium NaN Wheel Loader - 150.0 to 175.0 Horsepower ... NaN 2 Valve 23.5 10.950807
2 2001 NaN None or Unspecified Skid Steer Loader - 1351.0 to 1601.0 Lb Operating Capacity ... NaN Auxiliary NaN 9.210340

3 rows × 15 columns

Next, we have to know which columns should be treated as categorical so that they can be given embeddings. For that, we'll use cont_cat_split, but we set the max cardinality as 9000 (embedding sizes greater than 10,000 should be used after you tested if there's better ways to group the variable).

cont, cat = cont_cat_split(df, max_card=9_000, dep_var='SalePrice')

Mainly, we want to ensure saleElapsed is in the continuous section since by definition, if a label is in the categorical section, it cannot be extrapolated.

cont
['saleElapsed']

Next, let's see the cardinality of each of the categorical variables (if there's some that are similar in number, they may be redundant and we can try to remove all but one):

df[cat].nunique()
YearMade                73
ProductSize              6
Coupler_System           2
fiProductClassDesc      74
ModelID               5281
fiSecondaryDesc        177
Enclosure                6
fiModelDesc           5059
ProductGroupDesc         6
fiModelDescriptor      140
Drive_System             4
Hydraulics              12
Tire_Size               17
dtype: int64

It appears that ModelID and fiModelDesc may be redundant since they're both pertaining to Model and are of similar cardinality. So, we'll do what we did before and try removing fiModelDesc and see what happens to the results.

get_oob(xs_new)
0.8760441852100622
{c: get_oob(xs_new.drop(c, axis=1)) for c in 
 ['ModelID', 'fiModelDesc', 'fiModelDescriptor']}
{'ModelID': 0.8717292042012881,
 'fiModelDesc': 0.869256405033346,
 'fiModelDescriptor': 0.8748457618690582}

Overall, it seems we can remove fiModelDescriptor without it significantly affecting the model.

cat.remove('fiModelDescriptor')

To create DataLoaders for our model, we can use TabularPandas again. However, we have to add the Normalize TabularProc for a neural network since the scale of the variables matter unlike in building a decision tree.

tp = TabularPandas(df, [Categorify, FillMissing, Normalize],
                   cat, cont, splits=splits, y_names='SalePrice')
# a vision model since we're dealing with
# dabular data (doesn't require as much GPU RAM)
dls = tp.dataloaders(1024)

Next, we'll see what range we should have for our predictions.

y, v_y = tp.train.y, tp.valid.y
y.min(),y.max(),v_y.min(),v_y.max()
(8.465899467468262, 11.863582611083984, 8.465899467468262, 11.849397659301758)

So, we can set our y_range as (8, 12) (remember that a model tends to do better when we have our upper bound a little higher than the actual maximum).

learn = tabular_learner(dls, layers=[500, 250], 
                        y_range=(8, 12), n_out=1,
                        loss_func=F.mse_loss)

As always, we'll use .lr_find() to find the optimal learning rate:

lr = learn.lr_find().valley

And, we'll use fit_one_cycle to train our model since we're not transfer learning:

learn.fit_one_cycle(5, lr)
epoch train_loss valid_loss time
0 0.068500 0.075998 00:07
1 0.052604 0.063197 00:07
2 0.047209 0.059763 00:07
3 0.043221 0.061487 00:07
4 0.039401 0.053084 00:07
preds, targs = learn.get_preds()
r_mse(preds, targs)
0.230399

Overall, we got a better RMSE than the random forest model, but not by a lot.

Now, we'll try using the embeddings from the neural network to replace the categorical columns for our random forest:

# code is essentially verbatim from fast.ai's @danielwbn:
# https://forums.fast.ai/t/using-embedding-from-the-neural-network-in-random-forests/80063/10
def transfer_embeds(learn, xs):
    xs = xs.copy()
    for i, feature in enumerate(learn.dls.cat_names):
        emb = learn.embeds[i].cpu() 
        # added .cpu() to learn since tensor below is made on cpu while learn is on cuda
        new_feat = pd.DataFrame(emb(tensor(xs[feature], dtype=torch.int64)),
                                index=xs.index, 
                                columns=[f'{feature}_{j}' for j in range(emb.embedding_dim)])
        xs = xs.drop(feature, axis=1)
        xs = xs.join(new_feat)
    return xs
xs_with_embs       = transfer_embeds(learn, learn.dls.train.xs) 
valid_xs_with_embs = transfer_embeds(learn, learn.dls.valid.xs)
m = rf(xs_with_embs, train_y)
m_rmse(m, xs_with_embs, train_y), m_rmse(m, valid_xs_with_embs, valid_y)
(0.184022, 0.236021)

The validation RMSE ends up becoming worse! But, take a look at this:

We'll look at our RMSE when we ensemble the predictions from random forests (with embeddings) and the neural network. However, before we can take the average of the predictions from random forests and the neural network, we have to make them of the same type. To do so, we have to turn our neural network predictions (a rank-2 tensor) into a rank-1 numpy array. We can apply .squeeze() to remove any unit axis (a vector with only 1 thing in it)

rf_preds = m.predict(valid_xs_with_embs)
ens_preds = (to_np(preds.squeeze()) + rf_preds) / 2
r_mse(ens_preds, valid_y)
0.228773

And, we now have an RMSE that's lower than the top leaderboard score of 0.22909, granted we're doing it on the validation set (since we don't have access to the SalePrice of Test.csv):

a = pd.read_csv('Test.csv', low_memory=False)
a.columns, 'SalePrice' in a.columns
(Index(['SalesID', 'MachineID', 'ModelID', 'datasource', 'auctioneerID',
        'YearMade', 'MachineHoursCurrentMeter', 'UsageBand', 'saledate',
        'fiModelDesc', 'fiBaseModel', 'fiSecondaryDesc', 'fiModelSeries',
        'fiModelDescriptor', 'ProductSize', 'fiProductClassDesc', 'state',
        'ProductGroup', 'ProductGroupDesc', 'Drive_System', 'Enclosure',
        'Forks', 'Pad_Type', 'Ride_Control', 'Stick', 'Transmission',
        'Turbocharged', 'Blade_Extension', 'Blade_Width', 'Enclosure_Type',
        'Engine_Horsepower', 'Hydraulics', 'Pushblock', 'Ripper', 'Scarifier',
        'Tip_Control', 'Tire_Size', 'Coupler', 'Coupler_System',
        'Grouser_Tracks', 'Hydraulics_Flow', 'Track_Type',
        'Undercarriage_Pad_Width', 'Stick_Length', 'Thumb', 'Pattern_Changer',
        'Grouser_Type', 'Backhoe_Mounting', 'Blade_Type', 'Travel_Controls',
        'Differential_Type', 'Steering_Controls'],
       dtype='object'), False)

So, we've finally covered tabular data training with decision trees, random forests, neural networks, and random forests with embeddings. We've ensembled the results of the last 2 to get a score that ("technically") beats the top leaderboard score of the Kaggle competition.

Next, we'll be moving onto natural language models. But, I might also try experimenting with another tabular data set.