Hello! I wanted to move onto natrual language processing by next week, but I also wanted some more practice with tabular data sets. And, what better way than to use a data set that's related to natural language: emails.

This Kaggle competition ran 9 years ago and is one of the "getting started" data sets. This data set contains "100 features extracted from a corpus of emails. Some of the emails are spam and some are normal. [Our] task is to make a spam detector."

I've already downloaded the data set and we can first import them:

xs   = pd.read_csv('train.csv', low_memory=False)
y    = pd.read_csv('train_labels.csv', low_memory=False)
test = pd.read_csv('test.csv', low_memory=False)

Then, we'll define functions for our random forest trainer and the metric that this competition requires, which is the area under the ROC curve.

def rf(xs, y, n_estimators=40, min_samples_leaf=5, max_samples=300,
       max_features=0.5):
    return RandomForestRegressor(n_estimators=n_estimators,
                                 min_samples_leaf=min_samples_leaf,
                                 max_samples=max_samples,
                                 max_features=max_features,
                                 oob_score=True, n_jobs=-1).fit(xs, y)

def a_uc(preds, y):
    fpr, tpr, thresholds = metrics.roc_curve(y, preds, pos_label=1)
    return metrics.auc(fpr, tpr)

def m_auc(m, xs, y):
    return a_uc(m.predict(xs), y)

In making our TabularPandas, we'll merge the independent and dependent variables. But, unlike before we'll be using a randomized split since this isn't a time series.

df_merged = pd.concat((xs, y), axis=1).copy()

procs = [Categorify, FillMissing, Normalize]

cont, cat = cont_cat_split(df_merged, dep_var='0')

tp = TabularPandas(df_merged, procs, cat_names=cat, cont_names=cont, y_names='0', splits=RandomSplitter()(xs))

t_xs, v_xs, t_y, v_y = tp.train.xs, tp.valid.xs, tp.train.y, tp.valid.y

I'm not sure how to use TabularPandas that well since there's no categorical columns in this data set, but it seems like TabularPandas requires it. If there aren't any, then it duplicates all the columns, so we remove them here:

t_xs = t_xs.drop(tp.train.x_names[0:100], axis=1)

v_xs = v_xs.drop(tp.valid.x_names[0:100], axis=1)

Then, we train a decision tree as a baseline:

dt = DecisionTreeRegressor(min_samples_leaf=40).fit(t_xs, t_y)

m_auc(dt, t_xs, t_y), m_auc(dt, v_xs, v_y)

(0.9117870857001292, 0.9124478856462179)

Surprisingly, the baseline is already at 0.912. I'm also not sure if this metric is equivalent to accuracy in terms of saying it's 91.2% accurate, so I'll just refer to it as 0.912.

Next, let's train a random forest model:

m = rf(t_xs, t_y, min_samples_leaf=10, n_estimators=120)

m_auc(m, t_xs, t_y), m_auc(m, v_xs, v_y)

(0.9765178460830635, 0.9401429422275164)

And then, we'll train a neural network:

dls = tp.dataloaders(128)

Again, we're dropping the duplicated columns from TabularPandas:

dls.train.xs = dls.train.xs.drop(columns=dls.train.x_names[0:100])
dls.valid.xs = dls.valid.xs.drop(columns=dls.valid.x_names[0:100])

learn = tabular_learner(dls, y_range=(0, 1.1), n_out=1)

lr = learn.lr_find().valley

learn.fit_one_cycle(20, lr)

preds, targs = learn.get_preds()

a_uc(preds, targs)

0.9252531268612271

Finally, we ensemble the predictions from our baseline (since it's not far off from the neural network's ROC AUC), random forest model, and neural network:

rf_preds = m.predict(v_xs)
ens_preds = (to_np(preds.squeeze()) + rf_preds + dt.predict(v_xs)) / 3

a_uc(ens_preds, v_y)

0.953543776057177

From a fairly quick model with no really "complex" components added (that I would probably learn in the second part of fast.ai), we're able to come 36th in the leaderboards (which is top 75%).

Anyway, this post is just a quick recap on tabular model training using a relatively small data set. We didn't need to preprocess the data since TabularPandas handles the missing values for us. We trained a baseline, then trained a random forest and a neural network model, then ensembled the predictions. Perhaps I could have analyzed the columns and remove some of the unimportant features. Unlike last time, there aren't any categorical columns so I didn't use the embeddings to train another random forest model.

Maybe in the future I'll revisit this data set and aim to get a higher score.

epoch	train_loss	valid_loss	time
0	0.289919	0.264292	00:00
1	0.280569	0.260043	00:00
2	0.264891	0.248970	00:00
3	0.243126	0.230470	00:00
4	0.218251	0.207044	00:00
5	0.195376	0.184551	00:00
6	0.175897	0.165676	00:00
7	0.158557	0.149466	00:00
8	0.143941	0.136603	00:00
9	0.131354	0.127658	00:00
10	0.119663	0.120048	00:00
11	0.110008	0.114571	00:00
12	0.101210	0.111422	00:00
13	0.093147	0.109775	00:00
14	0.086211	0.108900	00:00
15	0.080164	0.107172	00:00
16	0.074857	0.106758	00:00
17	0.069664	0.106035	00:00
18	0.065428	0.105261	00:00
19	0.061637	0.104235	00:00