To practice the theory from my previous post, I'm going to be training a random forest model on a data set used by a previous Kaggle competition: "Blue Book for Bulldozers".

The contest took place 9 years ago and was sponsored by Fast Iron. Here is its description:

The goal of the contest is to predict the sale price of a particular piece of heavy equiment at auction based on it's usage, equipment type, and configuaration. The data is sourced from auction result postings and includes information on usage and equipment configurations.

Fast Iron is creating a "blue book for bull dozers," for customers to value what their heavy equipment fleet is worth at auction.

Preprocessing the data set

First, we download the data set using the Kaggle API (to see how I did this, open the blog in Colab). Post download, we have these files:

!ls

 bluebook-for-bulldozers.zip	    Test.csv	        Train.zip
'Data Dictionary.xlsx'		    tp.pkl	        Valid.7z
 kaggle.json			    Train.7z	        Valid.csv
 Machine_Appendix.csv		    TrainAndValid.7z    ValidSolution.csv
 median_benchmark.csv		    TrainAndValid.csv   Valid.zip
 random_forest_benchmark_test.csv   TrainAndValid.zip

Then, we can import the full data set into a Pandas DataFrame object:

# When low_memory = True (default), Pandas only looks at a
# few rows to see what type the data in a specific column
# is. If we want to minimize errors and don't run into a
# low memory issue, we should set low_memory = False.
df = pd.read_csv('TrainAndValid.csv', low_memory = False)

This data set contains 53 different features:

df.columns, len(df.columns)

(Index(['SalesID', 'SalePrice', 'MachineID', 'ModelID', 'datasource',
        'auctioneerID', 'YearMade', 'MachineHoursCurrentMeter', 'UsageBand',
        'saledate', 'fiModelDesc', 'fiBaseModel', 'fiSecondaryDesc',
        'fiModelSeries', 'fiModelDescriptor', 'ProductSize',
        'fiProductClassDesc', 'state', 'ProductGroup', 'ProductGroupDesc',
        'Drive_System', 'Enclosure', 'Forks', 'Pad_Type', 'Ride_Control',
        'Stick', 'Transmission', 'Turbocharged', 'Blade_Extension',
        'Blade_Width', 'Enclosure_Type', 'Engine_Horsepower', 'Hydraulics',
        'Pushblock', 'Ripper', 'Scarifier', 'Tip_Control', 'Tire_Size',
        'Coupler', 'Coupler_System', 'Grouser_Tracks', 'Hydraulics_Flow',
        'Track_Type', 'Undercarriage_Pad_Width', 'Stick_Length', 'Thumb',
        'Pattern_Changer', 'Grouser_Type', 'Backhoe_Mounting', 'Blade_Type',
        'Travel_Controls', 'Differential_Type', 'Steering_Controls'],
       dtype='object'), 53)

According to the "Data" tab in the Kaggle competition page, the key features are in the data set are:

SalesID: the uniue identifier of the sale.
MachineID: the unique identifier of a machine. A machine can be sold multiple times.
SalePrice: what the machine sold for at auction (only provided in train.csv).
saledate: the date of the sale.

After we read the data, we should handle the orderable categorical variables (called ordinal). We can do so by explicitly setting an order and telling Pandas that we want this column to be ordered by this new rule. The only feature for which this applies is the ProductSize column:

df['ProductSize'].unique()

array([nan, 'Medium', 'Small', 'Large / Medium', 'Mini', 'Large', 'Compact'], dtype=object)

Currently, it's in a random order, but we can give it a specific order:

sizes = 'Large','Large / Medium','Medium','Small','Mini','Compact'
df['ProductSize'] = df['ProductSize'].astype('category')
df['ProductSize'] = df['ProductSize'].cat.set_categories(sizes, ordered = True)

df['ProductSize'].unique()

[NaN, 'Medium', 'Small', 'Large / Medium', 'Mini', 'Large', 'Compact']
Categories (6, object): ['Large' < 'Large / Medium' < 'Medium' < 'Small' < 'Mini' < 'Compact']

Next, we have to use the correct metric for the competition. For regression problems, we usually use RMSE, but the Kaggle competition usually states which metric to use. For this competition, we use RMSLE (root mean squared log error, which is the same as RMSE, but the predictions are log'd before RMSE). So, as a little preprocessing, we log the SalePrice column (and we can just use RMSE later and handle smaller prediction values while training):

df['SalePrice'] = np.log(df['SalePrice'])

RMSLE is better to use than RMSE when you want to get a more accurate prediction. RMSLE is also larger for when you underestimate vs. when you overestimate. To learn more about RMSLE, read this blog post.

We should also handle dates. Currently, the dates are just... the dates.

df['saledate'].head()

0    11/16/2006 0:00
1     3/26/2004 0:00
2     2/26/2004 0:00
3     5/19/2011 0:00
4     7/23/2009 0:00
Name: saledate, dtype: object

Dates are pretty special since some dates are more important than others. There's holidays, Fridays, Mondays, weekends, end of quarter, end of year, etc. So, we use fastai's add_datepart function that splits the date column into metadata columns.

df = add_datepart(df, 'saledate')

Evidently, fastai has generated many metadata columns and we now have 13 date related features:

L([i for i in df if i.startswith('sale')]), len(df.columns)

((#13) ['saleYear','saleMonth','saleWeek','saleDay','saleDayofweek','saleDayofyear','saleIs_month_end','saleIs_month_start','saleIs_quarter_end','saleIs_quarter_start'...],
 65)

Next, we need to transform our data such that it has no strings or missing values. To do so, we use fastai's TabularPandas by passing in the DataFrame, TabularProcs, which features are categorical and which are continuous, what the dependent variable is, and how we're splitting the data.

A TabularProc is a special kind of Transform that performs the transform when it's called instead of lazily as it's accessed, and returns the exact same object after modifying it in place.

Categorify sets a number to each level in categorical columns, effectively making them continuous, with each assigned number having no meaning (the model learns the meanings on its own). If we defined an order beforehand like we did with ProductSize, then it takes that order instead of assigning a random ordering.

FillMissing takes missing values in continuous columns and assigns the average value of that continuous column.

procs = [Categorify, FillMissing]

We can use fastai's cont_cat_split that returns a tuple: (continuous labels, categorical labels). When we define dep_var = 'SalePrice', we tell cont_cat_split to ignore that column. The 1 is for the cardinality. Any column with cardinality greater than 1 will be assigned as continuous.

cont, cat = cont_cat_split(df, 1, dep_var = 'SalePrice')
L(cont), L(cat)

((#14) ['SalesID','MachineID','ModelID','datasource','auctioneerID','YearMade','MachineHoursCurrentMeter','saleYear','saleMonth','saleWeek'...],
 (#50) ['UsageBand','fiModelDesc','fiBaseModel','fiSecondaryDesc','fiModelSeries','fiModelDescriptor','ProductSize','fiProductClassDesc','state','ProductGroup'...])

To split our data as training and validation, we want to construct it so that our model can generalize to unseen data. Since this is a time series data set, we want our training set to occur before our validation set. So, we can split it according to time (TrainAndValid.csv goes up to April 2012, with Test.csv being six months after May 2012. We'll split our data such that the training is six months before April 2012: before November 2011):

# Forms a copy of our current DataFrame, but has
# a True where the condition is true. The pipe
# '|' acts as 'union' for the sets. So, our
# condition is that it is before 2011 or November
cond = (df['saleYear'] < 2011) | (df['saleMonth'] < 11)

# The first element of the tuple produced by
# np.where contains the indices, i, where cond[i]
# was True
train_idx = np.where( cond)[0]

# Applying ~ to cond turns all True values to
# False and False to True
valid_idx = np.where(~cond)[0]

splits = (list(train_idx), list(valid_idx))

Now, we can put all of them inside of a TabularPandas:

tp = TabularPandas(df, procs, cat, cont, y_names = 'SalePrice', splits = splits)

When we display the data, they don't seem changed:

tp.show(3)

But when we display them as items (what our model's going to see), all the categorical variables are changed to continuous variables:

tp.items.head(3)

And, like how we defined an order for ProductSize, the order is maintained:

tp.classes['ProductSize']

['#na#', 'Large', 'Large / Medium', 'Medium', 'Small', 'Mini', 'Compact']

But for a feature whose order we didn't define, they're ordered alphabetically:

tp.classes['state']

['#na#', 'Alabama', 'Alaska', 'Arizona', 'Arkansas', 'California', 'Colorado', 'Connecticut', 'Delaware', 'Florida', 'Georgia', 'Hawaii', 'Idaho', 'Illinois', 'Indiana', 'Iowa', 'Kansas', 'Kentucky', 'Louisiana', 'Maine', 'Maryland', 'Massachusetts', 'Michigan', 'Minnesota', 'Mississippi', 'Missouri', 'Montana', 'Nebraska', 'Nevada', 'New Hampshire', 'New Jersey', 'New Mexico', 'New York', 'North Carolina', 'North Dakota', 'Ohio', 'Oklahoma', 'Oregon', 'Pennsylvania', 'Puerto Rico', 'Rhode Island', 'South Carolina', 'South Dakota', 'Tennessee', 'Texas', 'Unspecified', 'Utah', 'Vermont', 'Virginia', 'Washington', 'Washington DC', 'West Virginia', 'Wisconsin', 'Wyoming']

Now we processed our data and we can start training a random forest model!

Modeling

First, we'll get the split data from our TabularPandas object:

train_xs, train_y = tp.train.xs, tp.train.y
valid_xs, valid_y = tp.valid.xs, tp.valid.y

Then, we'll make a decision tree model as a baseline

dt = DecisionTreeRegressor(min_samples_leaf = 25)
dt.fit(train_xs, train_y)

DecisionTreeRegressor(min_samples_leaf=25)

To see how well our model did, we can take the RMSE of our predictions since we took the log beforehand (so our models predict the log of the sale price). We'll define the functions so that we don't have to repeat it later:

def r_mse(preds, y):
    return round(math.sqrt(((preds - y)**2).mean()), 6)
    
def m_rmse(m, preds, y):
    return r_mse(m.predict(preds), y)

m_rmse(dt, train_xs, train_y), m_rmse(dt, valid_xs, valid_y)

(0.211429, 0.266451)

So, we should be trying to train a model that has a better RMSE than 0.266.

We'll train our random forest model as an ensemble of 40 decision trees:

def rf(xs, y, n_estimators = 40, max_samples = 200_000, 
       max_features = 0.5, min_samples_leaf = 5, **kwargs):
    return RandomForestRegressor(
        n_jobs       = -1,                   # use all CPU cores
        n_estimators = n_estimators,         # number of decision trees
        max_samples  = max_samples,          # max number of rows to get
        max_features = max_features,         # how many columns to get (%)
        min_samples_leaf = min_samples_leaf, # leaf nodes must have at least this many (to prevent overfitting)
        oob_score    = True                  # track out-of-box error score
    ).fit(xs, y)

m = rf(train_xs, train_y)

m_rmse(m, train_xs, train_y), m_rmse(m, valid_xs, valid_y)

(0.171233, 0.23292)

And, our out-of-bag error is:

r_mse(m.oob_prediction_, train_y)

0.210877

Since it's smaller than our validation error, our model shouldn't be overfitting and is instead having other problems.

Interpretation

First, we'll look at the confidence of each tree:

preds = np.stack([t.predict(valid_xs.values) for t in m.estimators_])
preds.shape # we should have 40 trees

(40, 5754)

preds_std = preds.std(0)
preds_std.sort()
preds_std

array([0.04779576, 0.05120998, 0.0513565 , ..., 0.5755972 , 0.57621563, 0.57734788])

The standard deviations of the predictions for each auction ranges from 0.00478 to 0.57735 since the trees for some auctions agree (low standard deviation), while other times they disagree (high standard deviation). In production, you could warn the user to be more wary of the prediction if the standard deviation is above a certain threshold.

To see which features were the most important, we can use the .feature_imporances_ attribute of our RandomForestRegressor model to get each columns' feature importance score. We can pair it up with the column names in a DataFrame and sort it to see which features were the most important.

def feature_importance(df, m):
    return pd.DataFrame({'Feature': df.columns, 'Importance': m.feature_importances_}).sort_values('Importance', ascending = False)

fi = feature_importance(train_xs, m)

fi[:10]

We'll try removing the unimportant features and see if it affects the model's performance. First, we set a threshold (0.005) and only take the columns whose importance is greater than that threshold:

imp = fi[fi.Importance > 0.005].Feature
len(imp)

21

Then, we make new training and validation sets:

train_xs_imp = train_xs[imp]
valid_xs_imp = valid_xs[imp]
train_xs_imp

Finally, we train a new model using the new training set:

m_imp = rf(train_xs_imp, train_y)

m_rmse(m_imp, train_xs_imp, train_y), m_rmse(m_imp, valid_xs_imp, valid_y)

(0.181282, 0.233111)

Although our training metric become a bit worse, our validation metric isn't far off, so we can train future models using this new training set since we can effectively remove 2/3 of the "unncessary" columns.

Next, we'll remove the redundant columns. We'll use cluster_columns which determines feature similarity through rank correlation.

cluster_columns(train_xs_imp)

We'll see how removing each redundant feature affects the model's capability by defining a function that trains a quick random forest model and returns its out-of-box error:

def get_oob(df):
    m = RandomForestRegressor(
        n_estimators = 40, 
        min_samples_leaf = 15, # higher to have a shorter depth tree
        max_samples = 50_000,
        max_features = 0.5,
        n_jobs = -1,
        oob_score = True).fit(df, train_y)
    return m.oob_score_

The oob_score_, unlike oob_prediction_, returns $R^2$, so a perfect model has a score of 1.0 while a random models has 0.0. As a baseline, we'll first find the oob_score_ of our current training set:

get_oob(train_xs_imp)

0.8771422789129072

Then, we'll try removing each possibly redundant feature one by one:

{c: get_oob(train_xs_imp.drop(labels = c, axis = 'columns'))
            for c in ('saleYear', 'saleElapsed', 'Grouser_Tracks', 
                      'Hydraulics_Flow', 'Coupler_System', 
                      'ProductGroup', 'ProductGroupDesc',
                      'fiBaseModel', 'fiModelDesc')}

{'Coupler_System': 0.8777286429358461,
 'Grouser_Tracks': 0.8781106723829636,
 'Hydraulics_Flow': 0.8778805337963219,
 'ProductGroup': 0.87719310108094,
 'ProductGroupDesc': 0.877706764282738,
 'fiBaseModel': 0.8755285591209638,
 'fiModelDesc': 0.8760491649430575,
 'saleElapsed': 0.8726949857217733,
 'saleYear': 0.8762485106840037}

Since the out-of-box score didn't change much, we'll try leaving just one of them in each group:

to_drop = ['saleYear', 'Grouser_Tracks', 'Hydraulics_Flow', 'ProductGroup', 'fiBaseModel']
get_oob(train_xs_imp.drop(to_drop, axis = 1))

0.8754873568883906

The score ultimately only goes down by 0.002 from our baseline, so we'll drop these labels from our training and validation set.

train_xs_fin = train_xs_imp.drop(to_drop, axis = 1)
valid_xs_fin = valid_xs_imp.drop(to_drop, axis = 1)

While we're at it, we'll also adjust the years so that the minimum year isn't at 1000 but instead at 1950

train_xs_fin.loc[train_xs_fin.YearMade < 1950, 'YearMade'] = 1950
valid_xs_fin.loc[valid_xs_fin.YearMade < 1950, 'YearMade'] = 1950

Now, we'll check that our RMSE didn't decrease significantly:

m = rf(train_xs_fin, train_y)
m_rmse(m, train_xs_fin, train_y), m_rmse(m, valid_xs_fin, valid_y)

(0.182645, 0.234452)

The validation RMSE didn't decrease much and now we only have 16 columns instead of 66.

len(train_xs_fin.columns)

16

With most of the unnecessary features removed, we can revisit feature importance and look at the partial dependence plots of some of the most important features and how our predictions depend on the features:

from sklearn.inspection import PartialDependenceDisplay

fig, ax = plt.subplots(figsize = (18, 4))
PartialDependenceDisplay.from_estimator(m, train_xs_fin, 
    ['YearMade', 'Coupler_System', 'ProductSize'], grid_resolution = 20, ax = ax)

<sklearn.inspection._plot.partial_dependence.PartialDependenceDisplay at 0x7fa97cbd3150>

From these plots, it appears that price has an exponential relationship with year, which makes sense (since products tend to depreciate exponentially over time). But for the other two, the missing data seems to be quite important:

tp.classes['Coupler_System'], tp.classes['ProductSize']

(['#na#', 'None or Unspecified', 'Yes'],
 ['#na#', 'Large', 'Large / Medium', 'Medium', 'Small', 'Mini', 'Compact'])

Coupler systems, although it's the second most important feature, makes its most important decision on whether it was filled out or not... and with product sizes, although it makes sense that the price decreases as the size decreases, no size mentioned also plays a significant role.

If we were the ones who created the data set, it would be interesting to see why that might be the case. Maybe only data entered after a certain date contained these values, or maybe these columns aren't as important.

Next, we can look at how the features influence the prediction. To do so, we can use the treeinterpreter library to see how the features influence the prediction, and then use the waterfallcharts library to draw it.

For the treeinterpreter, we pass the model and the rows of data for which we want predictions. Then, it returns a tuple of three items:

prediction is the predicted value for the dependent variable.
bias is the mean of the dependent variable.
contributions is a list of contributions made by each feature.

So, prediction = bias - contributions.sum(). And, we'll plot the contributions using a waterfall chart.

nrows = 1
row = 0
prediction, bias, contributions = treeinterpreter.predict(m, valid_xs_fin[:nrows].values)
prediction[row], bias[row], contributions[row].sum()

(array([10.64886316]), 10.104243648162724, 0.5446195099995459)

waterfall(valid_xs_fin.columns, contributions[row], 
          rotation_value = 90, formatting = '{:,.2f}', net_label = 'Total')

<module 'matplotlib.pyplot' from '/usr/local/lib/python3.7/dist-packages/matplotlib/pyplot.py'>

So, it seems that YearMade, ModelID, fiSecondaryDesc, and fiModelDesc were the most important features for this row's prediction.

But, it's a bit weird that ModelID is an important feature. What would happen when we have a new system for ModelID in the future? So, we should try to find out-of-domain data by combining our training and validation sets and train a model to predict whether a row is from the training or validation set.

df_domain = pd.concat([train_xs_fin, valid_xs_fin])
is_valid  = np.array([0] * len(train_xs_fin) + [1] * len(valid_xs_fin))
m         = rf(df_domain, is_valid)

feature_importance(df_domain, m)[:6]

So, the main things that appear to change from the training and validation set are saleElapsed, SalesID,and MachineID. Like before, let's try removing each of them and see how our model changes.

m = rf(train_xs_fin, train_y)
m_rmse(m, valid_xs_fin, valid_y)

0.23369

{c: m_rmse(rf(train_xs_fin.drop(c, axis = 1), train_y), 
           valid_xs_fin.drop(c, axis = 1),
           valid_y)
 for c in ['saleElapsed', 'SalesID', 'MachineID', 'YearMade', 'ModelID']}

{'MachineID': 0.233734,
 'ModelID': 0.238536,
 'SalesID': 0.231391,
 'YearMade': 0.268748,
 'saleElapsed': 0.237849}

So, it appears that we can remove MachineID and SalesID without worry (the model actually improved when we removed SalesID).

to_drop = ['MachineID', 'SalesID']
xs_new       = train_xs_fin.drop(to_drop, axis = 1)
valid_xs_new = valid_xs_fin.drop(to_drop, axis = 1)

And we'll double check our model's RMSE didn't go up significantly:

m = rf(xs_new, train_y)
m_rmse(m, valid_xs_new, valid_y)

0.230939

Not only did it not go up, it actually went down. So, removing MachineID and SalesID, which probably correlated with YearMade and saleElapsed, made the model better.

For the next blog, I'll train a deep learning model using the information gained through training and analyzing a random forest model. Then, I'll use the embeddings learned by the neural network to retrain a random forest model and try to beat the #1 gold-medal-level model on the leaderboards.

	SalesID	SalePrice	MachineID	ModelID	...	saleIs_year_start	saleElapsed	auctioneerID_na	MachineHoursCurrentMeter_na
0	1139246	11.097410	999089	3157	...	1	1.163635e+09	1	1
1	1139248	10.950807	117657	77	...	1	1.080259e+09	1	1
2	1139249	9.210340	434808	7009	...	1	1.077754e+09	1	1

	UsageBand	fiModelDesc	fiBaseModel	fiSecondaryDesc	fiModelSeries	fiModelDescriptor	ProductSize	fiProductClassDesc	state	ProductGroup	ProductGroupDesc	Drive_System	Enclosure	Forks	Pad_Type	Ride_Control	Stick	Transmission	Turbocharged	Blade_Extension	Blade_Width	Enclosure_Type	Engine_Horsepower	Hydraulics	Pushblock	Ripper	Scarifier	Tip_Control	Tire_Size	Coupler	Coupler_System	Grouser_Tracks	Hydraulics_Flow	Track_Type	Undercarriage_Pad_Width	Stick_Length	Thumb	Pattern_Changer	Grouser_Type	Backhoe_Mounting	Blade_Type	Travel_Controls	Differential_Type	Steering_Controls	saleIs_month_end	saleIs_month_start	saleIs_quarter_end	saleIs_quarter_start	saleIs_year_end	saleIs_year_start	auctioneerID_na	MachineHoursCurrentMeter_na	SalesID	MachineID	ModelID	datasource	auctioneerID	YearMade	MachineHoursCurrentMeter	saleYear	saleMonth	saleWeek	saleDay	saleDayofweek	saleDayofyear	saleElapsed	SalePrice
0	Low	521D	521	D	#na#	#na#	#na#	Wheel Loader - 110.0 to 120.0 Horsepower	Alabama	WL	Wheel Loader	#na#	EROPS w AC	None or Unspecified	#na#	None or Unspecified	#na#	#na#	#na#	#na#	#na#	#na#	#na#	2 Valve	#na#	#na#	#na#	#na#	None or Unspecified	None or Unspecified	#na#	#na#	#na#	#na#	#na#	#na#	#na#	#na#	#na#	#na#	#na#	#na#	Standard	Conventional	False	False	False	False	False	False	False	False	1139246	999089	3157	121	3.0	2004	68.0	2006	11	46	16	3	320	1.163635e+09	11.097410
1	Low	950FII	950	F	II	#na#	Medium	Wheel Loader - 150.0 to 175.0 Horsepower	North Carolina	WL	Wheel Loader	#na#	EROPS w AC	None or Unspecified	#na#	None or Unspecified	#na#	#na#	#na#	#na#	#na#	#na#	#na#	2 Valve	#na#	#na#	#na#	#na#	23.5	None or Unspecified	#na#	#na#	#na#	#na#	#na#	#na#	#na#	#na#	#na#	#na#	#na#	#na#	Standard	Conventional	False	False	False	False	False	False	False	False	1139248	117657	77	121	3.0	1996	4640.0	2004	3	13	26	4	86	1.080259e+09	10.950807
2	High	226	226	#na#	#na#	#na#	#na#	Skid Steer Loader - 1351.0 to 1601.0 Lb Operating Capacity	New York	SSL	Skid Steer Loaders	#na#	OROPS	None or Unspecified	#na#	#na#	#na#	#na#	#na#	#na#	#na#	#na#	#na#	Auxiliary	#na#	#na#	#na#	#na#	#na#	None or Unspecified	None or Unspecified	None or Unspecified	Standard	#na#	#na#	#na#	#na#	#na#	#na#	#na#	#na#	#na#	#na#	#na#	False	False	False	False	False	False	False	False	1139249	434808	7009	121	3.0	2001	2838.0	2004	2	9	26	3	57	1.077754e+09	9.210340

	Feature	Importance
57	YearMade	0.184860
6	ProductSize	0.114785
30	Coupler_System	0.114240
7	fiProductClassDesc	0.078662
54	ModelID	0.058189
65	saleElapsed	0.050378
3	fiSecondaryDesc	0.038825
12	Enclosure	0.038811
32	Hydraulics_Flow	0.036170
1	fiModelDesc	0.032337

	Feature	Importance
5	saleElapsed	0.919558
9	SalesID	0.061309
13	MachineID	0.014026
0	YearMade	0.000824
4	ModelID	0.000767
11	fiModelDescriptor	0.000733