Hello! I wanted to move onto natrual language processing by next week, but I also wanted some more practice with tabular data sets. And, what better way than to use a data set that's related to natural language: emails.

This Kaggle competition ran 9 years ago and is one of the "getting started" data sets. This data set contains "100 features extracted from a corpus of emails. Some of the emails are spam and some are normal. [Our] task is to make a spam detector."

I've already downloaded the data set and we can first import them:

xs   = pd.read_csv('train.csv', low_memory=False)
y    = pd.read_csv('train_labels.csv', low_memory=False)
test = pd.read_csv('test.csv', low_memory=False)

Then, we'll define functions for our random forest trainer and the metric that this competition requires, which is the area under the ROC curve.

def rf(xs, y, n_estimators=40, min_samples_leaf=5, max_samples=300,
       max_features=0.5):
    return RandomForestRegressor(n_estimators=n_estimators,
                                 min_samples_leaf=min_samples_leaf,
                                 max_samples=max_samples,
                                 max_features=max_features,
                                 oob_score=True, n_jobs=-1).fit(xs, y)
def a_uc(preds, y):
    fpr, tpr, thresholds = metrics.roc_curve(y, preds, pos_label=1)
    return metrics.auc(fpr, tpr)

def m_auc(m, xs, y):
    return a_uc(m.predict(xs), y)

In making our TabularPandas, we'll merge the independent and dependent variables. But, unlike before we'll be using a randomized split since this isn't a time series.

df_merged = pd.concat((xs, y), axis=1).copy()
procs = [Categorify, FillMissing, Normalize]
cont, cat = cont_cat_split(df_merged, dep_var='0')
tp = TabularPandas(df_merged, procs, cat_names=cat, cont_names=cont, y_names='0', splits=RandomSplitter()(xs))
t_xs, v_xs, t_y, v_y = tp.train.xs, tp.valid.xs, tp.train.y, tp.valid.y

I'm not sure how to use TabularPandas that well since there's no categorical columns in this data set, but it seems like TabularPandas requires it. If there aren't any, then it duplicates all the columns, so we remove them here:

t_xs = t_xs.drop(tp.train.x_names[0:100], axis=1)
v_xs = v_xs.drop(tp.valid.x_names[0:100], axis=1)

Then, we train a decision tree as a baseline:

dt = DecisionTreeRegressor(min_samples_leaf=40).fit(t_xs, t_y)
m_auc(dt, t_xs, t_y), m_auc(dt, v_xs, v_y)
(0.9117870857001292, 0.9124478856462179)

Surprisingly, the baseline is already at 0.912. I'm also not sure if this metric is equivalent to accuracy in terms of saying it's 91.2% accurate, so I'll just refer to it as 0.912.

Next, let's train a random forest model:

m = rf(t_xs, t_y, min_samples_leaf=10, n_estimators=120)
m_auc(m, t_xs, t_y), m_auc(m, v_xs, v_y)
(0.9765178460830635, 0.9401429422275164)

And then, we'll train a neural network:

dls = tp.dataloaders(128)

Again, we're dropping the duplicated columns from TabularPandas:

dls.train.xs = dls.train.xs.drop(columns=dls.train.x_names[0:100])
dls.valid.xs = dls.valid.xs.drop(columns=dls.valid.x_names[0:100])
learn = tabular_learner(dls, y_range=(0, 1.1), n_out=1)
lr = learn.lr_find().valley
learn.fit_one_cycle(20, lr)

epoch train_loss valid_loss time
0 0.289919 0.264292 00:00
1 0.280569 0.260043 00:00
2 0.264891 0.248970 00:00
3 0.243126 0.230470 00:00
4 0.218251 0.207044 00:00
5 0.195376 0.184551 00:00
6 0.175897 0.165676 00:00
7 0.158557 0.149466 00:00
8 0.143941 0.136603 00:00
9 0.131354 0.127658 00:00
10 0.119663 0.120048 00:00
11 0.110008 0.114571 00:00
12 0.101210 0.111422 00:00
13 0.093147 0.109775 00:00
14 0.086211 0.108900 00:00
15 0.080164 0.107172 00:00
16 0.074857 0.106758 00:00
17 0.069664 0.106035 00:00
18 0.065428 0.105261 00:00
19 0.061637 0.104235 00:00

preds, targs = learn.get_preds()
a_uc(preds, targs)
0.9252531268612271

Finally, we ensemble the predictions from our baseline (since it's not far off from the neural network's ROC AUC), random forest model, and neural network:

rf_preds = m.predict(v_xs)
ens_preds = (to_np(preds.squeeze()) + rf_preds + dt.predict(v_xs)) / 3
a_uc(ens_preds, v_y)
0.953543776057177

From a fairly quick model with no really "complex" components added (that I would probably learn in the second part of fast.ai), we're able to come 36th in the leaderboards (which is top 75%).

Anyway, this post is just a quick recap on tabular model training using a relatively small data set. We didn't need to preprocess the data since TabularPandas handles the missing values for us. We trained a baseline, then trained a random forest and a neural network model, then ensembled the predictions. Perhaps I could have analyzed the columns and remove some of the unimportant features. Unlike last time, there aren't any categorical columns so I didn't use the embeddings to train another random forest model.

Maybe in the future I'll revisit this data set and aim to get a higher score.