Using auction data to predict sale price
You mean you can predict my worth?
To practice the theory from my previous post, I'm going to be training a random forest model on a data set used by a previous Kaggle competition: "Blue Book for Bulldozers".
The contest took place 9 years ago and was sponsored by Fast Iron. Here is its description:
The goal of the contest is to predict the sale price of a particular piece of heavy equiment at auction based on it's usage, equipment type, and configuaration. The data is sourced from auction result postings and includes information on usage and equipment configurations.
Fast Iron is creating a "blue book for bull dozers," for customers to value what their heavy equipment fleet is worth at auction.
First, we download the data set using the Kaggle API (to see how I did this, open the blog in Colab). Post download, we have these files:
!ls
Then, we can import the full data set into a Pandas DataFrame
object:
# When low_memory = True (default), Pandas only looks at a
# few rows to see what type the data in a specific column
# is. If we want to minimize errors and don't run into a
# low memory issue, we should set low_memory = False.
df = pd.read_csv('TrainAndValid.csv', low_memory = False)
This data set contains 53 different features:
df.columns, len(df.columns)
According to the "Data" tab in the Kaggle competition page, the key features are in the data set are:
SalesID
: the uniue identifier of the sale.MachineID
: the unique identifier of a machine. A machine can be sold multiple times.SalePrice
: what the machine sold for at auction (only provided in train.csv).saledate
: the date of the sale.
After we read the data, we should handle the orderable categorical variables (called ordinal). We can do so by explicitly setting an order and telling Pandas that we want this column to be ordered by this new rule. The only feature for which this applies is the ProductSize
column:
df['ProductSize'].unique()
Currently, it's in a random order, but we can give it a specific order:
sizes = 'Large','Large / Medium','Medium','Small','Mini','Compact'
df['ProductSize'] = df['ProductSize'].astype('category')
df['ProductSize'] = df['ProductSize'].cat.set_categories(sizes, ordered = True)
df['ProductSize'].unique()
Next, we have to use the correct metric for the competition. For regression problems, we usually use RMSE, but the Kaggle competition usually states which metric to use. For this competition, we use RMSLE (root mean squared log error, which is the same as RMSE, but the predictions are log'd before RMSE). So, as a little preprocessing, we log the SalePrice
column (and we can just use RMSE later and handle smaller prediction values while training):
df['SalePrice'] = np.log(df['SalePrice'])
RMSLE is better to use than RMSE when you want to get a more accurate prediction. RMSLE is also larger for when you underestimate vs. when you overestimate. To learn more about RMSLE, read this blog post.
We should also handle dates. Currently, the dates are just... the dates.
df['saledate'].head()
Dates are pretty special since some dates are more important than others. There's holidays, Fridays, Mondays, weekends, end of quarter, end of year, etc. So, we use fastai's add_datepart
function that splits the date column into metadata columns.
df = add_datepart(df, 'saledate')
Evidently, fastai has generated many metadata columns and we now have 13 date related features:
L([i for i in df if i.startswith('sale')]), len(df.columns)
Next, we need to transform our data such that it has no strings or missing values. To do so, we use fastai's TabularPandas
by passing in the DataFrame
, TabularProc
s, which features are categorical and which are continuous, what the dependent variable is, and how we're splitting the data.
A TabularProc
is a special kind of Transform
that performs the transform when it's called instead of lazily as it's accessed, and returns the exact same object after modifying it in place.
Categorify
sets a number to each level in categorical columns, effectively making them continuous, with each assigned number having no meaning (the model learns the meanings on its own). If we defined an order beforehand like we did with ProductSize
, then it takes that order instead of assigning a random ordering.
FillMissing
takes missing values in continuous columns and assigns the average value of that continuous column.
procs = [Categorify, FillMissing]
We can use fastai's cont_cat_split
that returns a tuple: (continuous labels, categorical labels)
. When we define dep_var = 'SalePrice'
, we tell cont_cat_split
to ignore that column. The 1
is for the cardinality. Any column with cardinality greater than 1 will be assigned as continuous.
cont, cat = cont_cat_split(df, 1, dep_var = 'SalePrice')
L(cont), L(cat)
To split our data as training and validation, we want to construct it so that our model can generalize to unseen data. Since this is a time series data set, we want our training set to occur before our validation set. So, we can split it according to time (TrainAndValid.csv
goes up to April 2012, with Test.csv
being six months after May 2012. We'll split our data such that the training is six months before April 2012: before November 2011):
# Forms a copy of our current DataFrame, but has
# a True where the condition is true. The pipe
# '|' acts as 'union' for the sets. So, our
# condition is that it is before 2011 or November
cond = (df['saleYear'] < 2011) | (df['saleMonth'] < 11)
# The first element of the tuple produced by
# np.where contains the indices, i, where cond[i]
# was True
train_idx = np.where( cond)[0]
# Applying ~ to cond turns all True values to
# False and False to True
valid_idx = np.where(~cond)[0]
splits = (list(train_idx), list(valid_idx))
Now, we can put all of them inside of a TabularPandas
:
tp = TabularPandas(df, procs, cat, cont, y_names = 'SalePrice', splits = splits)
When we display the data, they don't seem changed:
tp.show(3)
But when we display them as items (what our model's going to see), all the categorical variables are changed to continuous variables:
tp.items.head(3)
And, like how we defined an order for ProductSize
, the order is maintained:
tp.classes['ProductSize']
But for a feature whose order we didn't define, they're ordered alphabetically:
tp.classes['state']
Now we processed our data and we can start training a random forest model!
First, we'll get the split data from our TabularPandas
object:
train_xs, train_y = tp.train.xs, tp.train.y
valid_xs, valid_y = tp.valid.xs, tp.valid.y
Then, we'll make a decision tree model as a baseline
dt = DecisionTreeRegressor(min_samples_leaf = 25)
dt.fit(train_xs, train_y)
To see how well our model did, we can take the RMSE of our predictions since we took the log beforehand (so our models predict the log of the sale price). We'll define the functions so that we don't have to repeat it later:
def r_mse(preds, y):
return round(math.sqrt(((preds - y)**2).mean()), 6)
def m_rmse(m, preds, y):
return r_mse(m.predict(preds), y)
m_rmse(dt, train_xs, train_y), m_rmse(dt, valid_xs, valid_y)
So, we should be trying to train a model that has a better RMSE than 0.266.
We'll train our random forest model as an ensemble of 40 decision trees:
def rf(xs, y, n_estimators = 40, max_samples = 200_000,
max_features = 0.5, min_samples_leaf = 5, **kwargs):
return RandomForestRegressor(
n_jobs = -1, # use all CPU cores
n_estimators = n_estimators, # number of decision trees
max_samples = max_samples, # max number of rows to get
max_features = max_features, # how many columns to get (%)
min_samples_leaf = min_samples_leaf, # leaf nodes must have at least this many (to prevent overfitting)
oob_score = True # track out-of-box error score
).fit(xs, y)
m = rf(train_xs, train_y)
m_rmse(m, train_xs, train_y), m_rmse(m, valid_xs, valid_y)
And, our out-of-bag error is:
r_mse(m.oob_prediction_, train_y)
Since it's smaller than our validation error, our model shouldn't be overfitting and is instead having other problems.
First, we'll look at the confidence of each tree:
preds = np.stack([t.predict(valid_xs.values) for t in m.estimators_])
preds.shape # we should have 40 trees
preds_std = preds.std(0)
preds_std.sort()
preds_std
The standard deviations of the predictions for each auction ranges from 0.00478 to 0.57735 since the trees for some auctions agree (low standard deviation), while other times they disagree (high standard deviation). In production, you could warn the user to be more wary of the prediction if the standard deviation is above a certain threshold.
To see which features were the most important, we can use the .feature_imporances_
attribute of our RandomForestRegressor
model to get each columns' feature importance score. We can pair it up with the column names in a DataFrame
and sort it to see which features were the most important.
def feature_importance(df, m):
return pd.DataFrame({'Feature': df.columns, 'Importance': m.feature_importances_}).sort_values('Importance', ascending = False)
fi = feature_importance(train_xs, m)
fi[:10]
We'll try removing the unimportant features and see if it affects the model's performance. First, we set a threshold (0.005) and only take the columns whose importance is greater than that threshold:
imp = fi[fi.Importance > 0.005].Feature
len(imp)
Then, we make new training and validation sets:
train_xs_imp = train_xs[imp]
valid_xs_imp = valid_xs[imp]
train_xs_imp
Finally, we train a new model using the new training set:
m_imp = rf(train_xs_imp, train_y)
m_rmse(m_imp, train_xs_imp, train_y), m_rmse(m_imp, valid_xs_imp, valid_y)
Although our training metric become a bit worse, our validation metric isn't far off, so we can train future models using this new training set since we can effectively remove 2/3 of the "unncessary" columns.
Next, we'll remove the redundant columns. We'll use cluster_columns
which determines feature similarity through rank correlation.
cluster_columns(train_xs_imp)
We'll see how removing each redundant feature affects the model's capability by defining a function that trains a quick random forest model and returns its out-of-box error:
def get_oob(df):
m = RandomForestRegressor(
n_estimators = 40,
min_samples_leaf = 15, # higher to have a shorter depth tree
max_samples = 50_000,
max_features = 0.5,
n_jobs = -1,
oob_score = True).fit(df, train_y)
return m.oob_score_
The oob_score_
, unlike oob_prediction_
, returns $R^2$, so a perfect model has a score of 1.0 while a random models has 0.0. As a baseline, we'll first find the oob_score_
of our current training set:
get_oob(train_xs_imp)
Then, we'll try removing each possibly redundant feature one by one:
{c: get_oob(train_xs_imp.drop(labels = c, axis = 'columns'))
for c in ('saleYear', 'saleElapsed', 'Grouser_Tracks',
'Hydraulics_Flow', 'Coupler_System',
'ProductGroup', 'ProductGroupDesc',
'fiBaseModel', 'fiModelDesc')}
Since the out-of-box score didn't change much, we'll try leaving just one of them in each group:
to_drop = ['saleYear', 'Grouser_Tracks', 'Hydraulics_Flow', 'ProductGroup', 'fiBaseModel']
get_oob(train_xs_imp.drop(to_drop, axis = 1))
The score ultimately only goes down by 0.002 from our baseline, so we'll drop these labels from our training and validation set.
train_xs_fin = train_xs_imp.drop(to_drop, axis = 1)
valid_xs_fin = valid_xs_imp.drop(to_drop, axis = 1)
While we're at it, we'll also adjust the years so that the minimum year isn't at 1000 but instead at 1950
train_xs_fin.loc[train_xs_fin.YearMade < 1950, 'YearMade'] = 1950
valid_xs_fin.loc[valid_xs_fin.YearMade < 1950, 'YearMade'] = 1950
Now, we'll check that our RMSE didn't decrease significantly:
m = rf(train_xs_fin, train_y)
m_rmse(m, train_xs_fin, train_y), m_rmse(m, valid_xs_fin, valid_y)
The validation RMSE didn't decrease much and now we only have 16 columns instead of 66.
len(train_xs_fin.columns)
With most of the unnecessary features removed, we can revisit feature importance and look at the partial dependence plots of some of the most important features and how our predictions depend on the features:
from sklearn.inspection import PartialDependenceDisplay
fig, ax = plt.subplots(figsize = (18, 4))
PartialDependenceDisplay.from_estimator(m, train_xs_fin,
['YearMade', 'Coupler_System', 'ProductSize'], grid_resolution = 20, ax = ax)
From these plots, it appears that price has an exponential relationship with year, which makes sense (since products tend to depreciate exponentially over time). But for the other two, the missing data seems to be quite important:
tp.classes['Coupler_System'], tp.classes['ProductSize']
Coupler systems, although it's the second most important feature, makes its most important decision on whether it was filled out or not... and with product sizes, although it makes sense that the price decreases as the size decreases, no size mentioned also plays a significant role.
If we were the ones who created the data set, it would be interesting to see why that might be the case. Maybe only data entered after a certain date contained these values, or maybe these columns aren't as important.
Next, we can look at how the features influence the prediction. To do so, we can use the treeinterpreter
library to see how the features influence the prediction, and then use the waterfallcharts
library to draw it.
For the treeinterpreter
, we pass the model and the rows of data for which we want predictions. Then, it returns a tuple of three items:
prediction
is the predicted value for the dependent variable.bias
is the mean of the dependent variable.contributions
is a list of contributions made by each feature.
So, prediction = bias - contributions.sum()
. And, we'll plot the contributions using a waterfall chart.
nrows = 1
row = 0
prediction, bias, contributions = treeinterpreter.predict(m, valid_xs_fin[:nrows].values)
prediction[row], bias[row], contributions[row].sum()
waterfall(valid_xs_fin.columns, contributions[row],
rotation_value = 90, formatting = '{:,.2f}', net_label = 'Total')
So, it seems that YearMade
, ModelID
, fiSecondaryDesc
, and fiModelDesc
were the most important features for this row's prediction.
But, it's a bit weird that ModelID
is an important feature. What would happen when we have a new system for ModelID
in the future? So, we should try to find out-of-domain data by combining our training and validation sets and train a model to predict whether a row is from the training or validation set.
df_domain = pd.concat([train_xs_fin, valid_xs_fin])
is_valid = np.array([0] * len(train_xs_fin) + [1] * len(valid_xs_fin))
m = rf(df_domain, is_valid)
feature_importance(df_domain, m)[:6]
So, the main things that appear to change from the training and validation set are saleElapsed
, SalesID
,and MachineID
. Like before, let's try removing each of them and see how our model changes.
m = rf(train_xs_fin, train_y)
m_rmse(m, valid_xs_fin, valid_y)
{c: m_rmse(rf(train_xs_fin.drop(c, axis = 1), train_y),
valid_xs_fin.drop(c, axis = 1),
valid_y)
for c in ['saleElapsed', 'SalesID', 'MachineID', 'YearMade', 'ModelID']}
So, it appears that we can remove MachineID
and SalesID
without worry (the model actually improved when we removed SalesID
).
to_drop = ['MachineID', 'SalesID']
xs_new = train_xs_fin.drop(to_drop, axis = 1)
valid_xs_new = valid_xs_fin.drop(to_drop, axis = 1)
And we'll double check our model's RMSE didn't go up significantly:
m = rf(xs_new, train_y)
m_rmse(m, valid_xs_new, valid_y)
Not only did it not go up, it actually went down. So, removing MachineID
and SalesID
, which probably correlated with YearMade
and saleElapsed
, made the model better.
For the next blog, I'll train a deep learning model using the information gained through training and analyzing a random forest model. Then, I'll use the embeddings learned by the neural network to retrain a random forest model and try to beat the #1 gold-medal-level model on the leaderboards.