Just The Basics - Strata 2013 | predicting spam emails
This was actually the easiest tabular data set I could find...
Hello! I wanted to move onto natrual language processing by next week, but I also wanted some more practice with tabular data sets. And, what better way than to use a data set that's related to natural language: emails.
This Kaggle competition ran 9 years ago and is one of the "getting started" data sets. This data set contains "100 features extracted from a corpus of emails. Some of the emails are spam and some are normal. [Our] task is to make a spam detector."
I've already downloaded the data set and we can first import them:
xs = pd.read_csv('train.csv', low_memory=False)
y = pd.read_csv('train_labels.csv', low_memory=False)
test = pd.read_csv('test.csv', low_memory=False)
Then, we'll define functions for our random forest trainer and the metric that this competition requires, which is the area under the ROC curve.
def rf(xs, y, n_estimators=40, min_samples_leaf=5, max_samples=300,
max_features=0.5):
return RandomForestRegressor(n_estimators=n_estimators,
min_samples_leaf=min_samples_leaf,
max_samples=max_samples,
max_features=max_features,
oob_score=True, n_jobs=-1).fit(xs, y)
def a_uc(preds, y):
fpr, tpr, thresholds = metrics.roc_curve(y, preds, pos_label=1)
return metrics.auc(fpr, tpr)
def m_auc(m, xs, y):
return a_uc(m.predict(xs), y)
In making our TabularPandas
, we'll merge the independent and dependent variables. But, unlike before we'll be using a randomized split since this isn't a time series.
df_merged = pd.concat((xs, y), axis=1).copy()
procs = [Categorify, FillMissing, Normalize]
cont, cat = cont_cat_split(df_merged, dep_var='0')
tp = TabularPandas(df_merged, procs, cat_names=cat, cont_names=cont, y_names='0', splits=RandomSplitter()(xs))
t_xs, v_xs, t_y, v_y = tp.train.xs, tp.valid.xs, tp.train.y, tp.valid.y
I'm not sure how to use TabularPandas
that well since there's no categorical columns in this data set, but it seems like TabularPandas
requires it. If there aren't any, then it duplicates all the columns, so we remove them here:
t_xs = t_xs.drop(tp.train.x_names[0:100], axis=1)
v_xs = v_xs.drop(tp.valid.x_names[0:100], axis=1)
Then, we train a decision tree as a baseline:
dt = DecisionTreeRegressor(min_samples_leaf=40).fit(t_xs, t_y)
m_auc(dt, t_xs, t_y), m_auc(dt, v_xs, v_y)
Surprisingly, the baseline is already at 0.912. I'm also not sure if this metric is equivalent to accuracy in terms of saying it's 91.2% accurate, so I'll just refer to it as 0.912.
Next, let's train a random forest model:
m = rf(t_xs, t_y, min_samples_leaf=10, n_estimators=120)
m_auc(m, t_xs, t_y), m_auc(m, v_xs, v_y)
And then, we'll train a neural network:
dls = tp.dataloaders(128)
Again, we're dropping the duplicated columns from TabularPandas
:
dls.train.xs = dls.train.xs.drop(columns=dls.train.x_names[0:100])
dls.valid.xs = dls.valid.xs.drop(columns=dls.valid.x_names[0:100])
learn = tabular_learner(dls, y_range=(0, 1.1), n_out=1)
lr = learn.lr_find().valley
learn.fit_one_cycle(20, lr)
preds, targs = learn.get_preds()
a_uc(preds, targs)
Finally, we ensemble the predictions from our baseline (since it's not far off from the neural network's ROC AUC), random forest model, and neural network:
rf_preds = m.predict(v_xs)
ens_preds = (to_np(preds.squeeze()) + rf_preds + dt.predict(v_xs)) / 3
a_uc(ens_preds, v_y)
From a fairly quick model with no really "complex" components added (that I would probably learn in the second part of fast.ai), we're able to come 36th in the leaderboards (which is top 75%).
Anyway, this post is just a quick recap on tabular model training using a relatively small data set. We didn't need to preprocess the data since TabularPandas
handles the missing values for us. We trained a baseline, then trained a random forest and a neural network model, then ensembled the predictions. Perhaps I could have analyzed the columns and remove some of the unimportant features. Unlike last time, there aren't any categorical columns so I didn't use the embeddings to train another random forest model.
Maybe in the future I'll revisit this data set and aim to get a higher score.